StaffConnect | Post Mortem | October 7

StaffConnect Outage Period October 07-08, 2023, Postmortem

Overview

On October 07, 2023, 3pm PST the StaffConnect servers were taken offline for a prescribed outage period of 24 hours to resolve data state separation and replication failures. This period was extended by 10 hours. This explanation will be focused primarily on providing a more technical insight into the issues, technical debt, and resolution phases designed because of the recent October 7^th outage.

The issue is best explained in three parts:

StaffConnect’s development history
Silent catalyst to the outage
Outage and the response

StaffConnect’s Development History

StaffConnect’s development history is rich and quite mature. An independent developer first developed the product back in 2011, (earliest records found). While StaffConnect maintained its technical stack relevancy by upgrading frameworks to more recent versions (2019/2020) there is often a core component overlooked within upgrades and that is the underlying infrastructure of the service itself. This can be witnessed in a few key areas worth mentioning:

IP Routing
Replication
Service structure

StaffConnect operates two primary virtual machines located in USA and Australia. A virtual machine is best understood as a computer which does not physically exist as an individual computer but exists within a company’s mainframe often just comprising of a few chips for computation. We are then able to access the service via the internet and secure connection. Behind the scenes when a user connects to StaffConnect they are redirected to either our service in USA or Australia, whichever is closer based on their IP address.

Enabling this user redirection requires a great deal of logistical overhead despite it seeming like a straightforward concept. Due to this one factor, users are needing to have a seamless and near zero data input lag between the service in USA and Australia to ensure collaborative efforts do not overlap with each other, and to ensure that what is viewed by the user is identical data to what is viewed by users connected to the other global services. If there were instances where users in the USA updated payroll and that information was not transmitted fast enough to Australia in which another user actioned the same payroll item, the systems would not collide bringing an error—they would simply enable that payroll item to be actioned twice—causing a payment to be double charged. Just in that small example we can illustrate how imperative data latency is to operations and how IP routing conflicts with our ability to maintain user collaboration.

Replication is another tricky one to get right—StaffConnect employs what is noted as an InnoDB Cluster, which is a feedback loop of replication native to MySQL technology. As users input, update, or delete data within the application it produces a transaction log of all the events throughout time. How replication in StaffConnect works is that the server in the USA provides the transaction log to Australia and signals it to update everything in order from top to bottom of the transaction log (and vice-versa from Australia to USA). These transaction logs known as Binary Events play a pivotal role in disaster recovery for ‘Point-In-Time’ recovery as we can fall back to a particular location of the database, however its imperative to find a location in which data operations are at its most rested state for the most successful rollback. InnoDB Clusters are a wonderful way of replication; however, they operate in a vertical state of replication—meaning that size is required to be expanded on every server at the same time as they are all storing identical information (this is counter to a more modern approach called horizontal scaling where you partition data across a series of smaller datastores and only update the areas requiring more storage.) InnoDB Clusters are also best maintained on local networking where servers are all within the same network connection. This is because of the method of data transportation and the desire for minimal data input lag (as illustrated in our payroll example before) we want to ensure every machine can get the transaction log messages as fast as possible to ensure each server is at the same point in time as often as possible.

Lastly is the service structure of StaffConnect. As we have already lightly touched on—the scalability of an InnoDB Cluster is quick to scale as data is written not just once, but as many times as your cluster is large. However, it is a double-edged sword because the larger the cluster, the safer and more fault-tolerant your operations are. StaffConnect, however, had historically operated with disadvantages of both scenarios; running two services within a cluster is disadvantageous for reasons that are illustrated in our current outage scenario. As per the MySQL documentation, three is the advisable minimal number of services an InnoDB Cluster should be in order to maintain its benefits. Due to this, StaffConnect historically had a rather small degree of fault tolerance, as there were never any ‘witness’ servers that were not directly utilized as a live trafficked server to step in to replace a downed instance.

Silent Catalyst to Outage

Now that we understand the background and limitations of StaffConnect’s original design we can begin to review the issue that occurred, which had roots back in July of 2023. In July of this year, you may remember that there was an email service outage. This email service outage lasted 24 hours and impacted an astounding 1.8 million attempted send emails. This was an outrageously large amount compared to usual daily volumes and upon inspection it was clear that something had malfunctioned within our Australian service. A feedback loop had developed somewhere resulting in a massive replication send attempt of the same email, over 800,000 times. Not only did this take down the email service it caused our Australian service to hit maximum capacity in storage. While we managed to clear the issue and resolve our capacity problems, what occurred in the background was the automation of a backup state and the Binary Transaction Log which silently pushed our Australian service into capacity again. This was undetected by monitoring systems as they manage the input and output of data within the database—not data writing within the virtual machine’s local store. This silent capacity creep resulted in the server seizing, failing, and needing to be pushed into a rescue state.

As previously mentioned, a normal InnoDB Cluster would have the provided fault tolerance to then route the Australian services to the spare database, providing us with time the revive the downed server. Unfortunately for us the Australian services were therefore routed to the only live service, the USA. This resulted in load times becoming incredibly long and sometimes even timing out—however more detrimental to the future of StaffConnect’s revived state—the loss of replication for the duration of the downed Australian server.

We were able to revive the Australian server mid-August after having to redevelop a large majority of the system, however once reinitialized, we noticed that replication was only bringing in data from as far back as three days. This meant that we did not have the transactional logs that were imperative to a point-in-time recovery—this left us with one option, we had to replicate the system in one service (USA) to the system in the other (Australia).

To avoid another instance of StaffConnect services having an outage due to the InnoDB cluster issue, a third server was provisioned within Eastern USA.

Upon restoring the third user’s database the Australian service stopped and resulted in a technical error—we had a corrupted database somewhere within the system. However, the current architecture was designed to take daily backups, so we are able to restore it back to normal; like nothing ever happened, at least for every database except the primary administrative database that houses all client’s domains and access tokens for their websites.

This was the outage that occurred on September 26^th, when our administrative database corrupted on our American server—and by the luck of InnoDB replication, our Australian server as well. This outage lasted 2 hours as we utilized our newly provisioned third database’s safe state to restore the corruption, additional time was spent building safety provisions surrounding the issue.

The Australian service did not recover as elegantly however, as the authorization tokens that were existing on the Eastern USA (EUSA) server were copies of the Western USA (WUSA) server—meaning that users could not connect to their applications when routed through Australia. Australian services were once again paused for a 48-hour period—bringing back our increased traffic issues on the WUSA server.

Once Australian servers resumed service there was once again missing data as services were redirected back to our WUSA server, undoing all replication efforts that were just completed. At this point it was moved to enact an emergency outage period lasting 24 hours. In this outage the goal was to reconcile data between WUSA and Australia, as well as refresh the point in time the transaction log starts tracking to once the service resumed and ensure that all three servers (Australia/WUSA/EUSA) maintain an identical state as often as possible.

Outage, Response Plan and Future Resolution

The outage was planned for October 7-8, 2023, for a 24-hour period starting at 3:00pm PST. This was then extended by 10 hours. The goal of the outage was to accomplish the below:

Data that was recorded within our WUSA server during the Australian outage is brought back to the Australian server.
Data replication service is flushed to begin replicating from service resumption.
Data replication service to operate within prescribed constraints with monitoring logs.

The method of approach is described in the phases below:

Take backup copies of every client database on both WUSA and Australian servers.
Identify our clients who have users utilizing the Australian server—these users will require a merge of databases, and then restoration on the alternative server.
1. The merger process was by comparing our WUSA state with our Australian service.
2. Any data found only in WUSA between the outage period would be added to the Australian server.
3. If a data record was found on both servers but had differing data, the source that had the most recently updated data would become the merged value.
4. No data removal was present.
Remaining clients will have their WUSA backups restored onto the Australian service in the event of an outage or global user access.
Run authentication validation tests on each client database, on both WUSA and Australian servers.
Flush replication logs and restart replication service.
Run replication tests and data replication latency evaluations.
Resume service.

‍

It was originally intended that we would attempt to recover lost data between Sept 30-Oct 1, however upon attempting to pull and utilize the Binary Log Events we noted that what transpired was a large degree of duplicated records, some with possible financial repercussions. Because every one of our clients has their own processes and data structures—The most prudent course of action was to not proceed across the system. We look to be in contact with clients with noted missing data between this period and discuss the pros and cons around this form of recovery.

The future resolution, as alluded to within my note to agency owners, StaffConnect has enlisted a well-established agency to assist in redeveloping StaffConnect’s infrastructure to meet its current and future needs. This will be a revolutionary change for StaffConnect and all our clients. By bringing our services into the Cloud we will have access to technologies that have revolutionized the concept of highly available data and services. We will look to change the method of user routing to a server most closely located to the individual and instead look to enable clients to state their primary region of operation (WUSA/EUSA/Australia) in which your data will be hosted within. These three locations themselves will have their own methods of replication individually to ensure that each locality is managed to a fault-tolerant state. Additionally, the service providers we are cooperating with will be providing access to all StaffConnect services at a 99.95% - 99.99% monitored uptime, with access to a team of over forty engineers 24/7. This move is estimated to resolve a staggering 90% of current StaffConnect outages, while increasing performance speeds by 30%. The migration plan is presently in development with the goal of development to be parallel to current services, only with a change of hands once we have extensively evaluated and verified its stability. The timeline for the migration is slated to be complete before the end of this year.

This future response plan is directly in line with our vision of StaffConnect’s future and our commitment to bringing you a new and improved service. Please be on the lookout for an announcement following this one where I dive into the product design methodology of the much-anticipated StaffConnect 2.0.

I want to take this moment to express my deepest apologies for the disturbance this has caused to your organizations and want to reassure you that our focus is on providing a more stable, dependable, and feature-rich application. I want to also express my sincerest gratitude for everyone’s patience while we work day and night to resolve 12 years of technical debt. Through the data reconciliation process we resolved over 300 databases, each consisting of 158 tables with an average of 500,000 rows of data- totalling a staggering 150,000,000 rows of data within a 34-hour period to resume your services.

Sincerely,

Jason Kwong

Chief Technology Officer

StaffConnect

Frequently Asked Questions

Have a question? Let's see if we can help you here.

Can't Find An Answer?

Reach out to our support team, we'd be more than happy to help!