top of page
Search
Writer's picture: PeterPeter

For the first time since we launched, our API server has been offline due to the extreme upheaval in Los Angeles due to 5 major fires across the area.


We hope to be back online mid week.


Our heart goes out to everyone affected.


1/15 : Our API is now back online.

 
 
 
Writer's picture: PeterPeter


This last month we upgraded from MongoDB 3.6 to MongoDB 8.0 for our API engine.


Keeping up to date on core software is essential for any long-lived public facing API, both for security standards and a few performance improvements as well.


Since direct database access via pymongo was integral to our backend - 40+ usages in 15+ files - it was a software dependency with 'deep roots'. This wasn't something that could be just done in one day. At least, not safely.

Our plan and implementation steps were:

  • Install MongoDB V8.0.3

  • Create a new test branch in our code repo for MongoDBV8.

  • Add a layer of abstraction for which MongoDB version was running in which subdomain (dev | stage | live).

  • Ensure multiple versions of MongoDB could be running simultaneously on different target directories and ports in our API software stack. There were new parameters for running V8 more safely alongside a V3 version of MongoDB on the same system, a temporary situation that usually is to be avoided, but frequently required for data migrations.

  • Implement parallel MongoDB instances to on the above changes in our test domain.

  • Develop a migrator python script, to pull all records out of 3.0 to 8.0

  • Run the migrator in our test area.

  • Review the data in the new 8.0 database, ensuring the integrity of the migration process.

  • Run the test area version of our API on the 8.0 version, run both automated and manual tests on it. Also do deep dive integrity checks on our automated database backup framework.

  • Remove the older 3.x version of MongoDB from our stack framework in the test area.

  • Repeat the above testing on our stage release area.

  • Decide a good time to implement the migration on our live API and implement the change in as little time as possible. We were crossed over in a few minutes.

  • Re-run all tests to make sure no surprises.

  • Sweep through and remove any temporary code required for non-versioned MongoDB access that was required for migration to the new layer MongoDB version layer of abstraction.

  • Commit our MongoDB8 code changes from the MongoDB8 branch into our main branch, and re-release and re-test.

  • Delete our old backup archives based on MongoDB v3

  • Verify our ongoing backup processes are working as expected.

  • Repeat a much simpler version of the above for upgrading pymongo, our python module bridge to MongoDB.


In the end, our migration took about 2 weeks, just doing one or two steps every day or so. Taking it slow ensured no big surprises or panics that could arise from unexpected migration code or MongoDB behavior. Usually one hears much focus on software development practices that ensure rapid release capability for making changes. However making substantive changes with core supporting apps with a large surface area of interaction throughout the code base is a different challenge. By doing it slowly over time, with the freedom to check and verify the data was flowing as expected all the way through the migration, we made this migration a happily invisible upgrade to our clients. (Other than the benefits of software currency and speedup improvements.)

 
 
 
Writer's picture: PeterPeter


This entry will be a bit tech heavy, but some of our customers enjoy hearing of our technical progress, even if it is behind the scenes. And yes, it’s good for everyone to know we’re always working here. 


Our backend software stack - which supports our API - consists of around 50 independent processes. Some of these processes are clones; duplicates of a program running at the same time. This pattern of running duplicates improves the resiliency of our API; if any one process has a problem or crashes, the others remain up and operating and handling any animation creation work that needs to be done. The administrator receives a notification of the crashed process, and debugging can start right away. A replacement process is also started, but there’s never any interruption in service, since multiple other processes are already integrated and actively picking up work.


The throughput of the API stack is increased as well, since multiple processes can work in parallel on different API requests. Scalability arrives too: if a larger volume of API calls needs to be handled, we simply increase the number of cloned processes. 


This software design pattern is known as Competitive Consumers, and here is a good summary article.


Our stack has always had a competitive consumer design component which works well, but in the past was not optimized for time separation of work requests. Each process is repeatedly checking the work queue database for any available work. There must be a time delay between each work request, otherwise there’s too much overhead and stress on both network and cpu because each process would be ‘hammering’ the system asking for work. 


So, if there must be a time delay between requests, initially, a solution as simple as a constant length sleep statement between work requests can serve to balance the need for:


  • a low system load, and also

  • a rapid turnaround for processing work requests.


This fixed sleep time approach works, but it also has a negative impact when scaling is needed and the running count of clones is increased. 


Let’s examine the effects of a simple case. Say initially there are 2 processes competing for work, hard coded to sleep 1 second between work requests. 


The queue server will be getting an average of around 2 requests per second.


These 2 processes could be started exactly 0.5 seconds from each other, which would optimize the minimal time between work request pickup and alternating process requests hitting the queue server to minimize stress on the queue server. 


When perfectly timing and spaced, the queue server would be getting one work request every 0.5 seconds, one from each process, repeating a work query to the database every 1 second.


But we know in the real world computer processing isn’t perfect. There will be a constant time-skewing of requests between calls as driven by natural random delays on the system and the randomness of how much time each work request takes. This will result in ‘request pile ups’ where both processes could eventually be requesting work at the same time, inefficiently hammering the queue server in bursts, and also increasing the maximum wait time to almost 1 second.


So let’s ignore the ‘time skew’ request pile up problem for just a moment, and focus on the simpler problem of hard coded sleep statement’s effect on scaling. 


If we scale up the process count by adding 8 processes for a total of 10, also sleeping a hard coded 0.5 seconds between work requests, then the queue server will have to handle 5x as much queue requests. 


This is reasonably solved by calculating the sleep length as a function of the number of processes running, so the more processes that are running, the longer each process waits between calls. 


Now imagine 10 processes with the above sleep adjustment, each sleeping 5 seconds before querying for work, with the queue server getting a request every 0.5 seconds.


… but the skewing-over-time issue remains. With 10 processes running, they could frequently time skew into a pattern where they were all hammering the queue server at the same time with a large time gap of as much as almost 5 seconds between any specific work request getting picked up from the queue server.


Not optimal!


We decided on a solution where we declared a desired time separation between requests arriving at the queue server. So say we want to declare that the queue server should queried once every 0.25 seconds. Remember, the goal here is to go lightweight on the queue server, and go light on the network, but still get work done in a reasonable amount of time.


If we know we want the queue server to be hit every 0.25 seconds, and we also know 


The number of clones


The index of the clone making the calculation


A standard offset for each process group type


The current time when the process has completed work or asking if there is work


Then we can calculate a specific time to sleep such that the next request falls on a specific time slot within a one minute period, such that each process separates its call to the queue server in a consistent manner, with no time skew, even if the time needed for work requests varies significantly. 


Rather than our sleep statements being hard coded, they are dynamically determined, adjusting for system time skew, the variable time needed to get work requests done, the number of clones running, and the type of clone running.


This was a fun implementation to make, and we saw the desired effect: much more consistent low cpu/network utilization, and also much more consistent and reduced API request times for completion.


Frequently in software engineering it’s best to ‘get things working correctly’ first, and then circle back later and optimize. It depends on the project needs of course, but there are definitely times when too much focus can be made on coming up with the perfect optimized code first, only delaying getting something useful in the user’s hands. It’s always particularly satisfying to implement a significant performance optimization on a system that has been working for years!


 
 
 
bottom of page