When staging environment differs from production

Standard

We often set up staging as a mirror of production where we can do QA and load tests to ensure that the code is “production ready” before we tag it and push it out to production

But sometimes it is often not that straightforward. There are times where the staging behaviour does not reflect the production behaviour and we ask ourselves why. For example, certain API calls or web pages seems to take much longer to download in production compared to staging even though the app code is the same.

Often it is the case of data mismatch. You may have only a few hundred users in staging environment but your production is already in the league of millions of users. It could also be that the production data for certain datatype (e.g. text) is excessive or use certain symbols or characters that hinders server performance.

So it comes to me a situation where I was asked to optimise an API call, and I could not get the same behaviour on Staging environment. This is what I did:

  1. Grab the latest copy of the database from the production and restore it into staging.
  2. Take note of the usernames of users who frequently log in in production
  3. Use their IDs and do a load test using locust.io (You may need to make some changes to the authentication module in staging)
  4. Tweak the weightage of the API / Web url in locust.io till your staging behave like production. Also match the RPS load.

Most of the time, you don’t need to match the staging and the production instance type. Staging instance can be slightly weaker than production. From there, you will be able to discern how to optimise. Some of them are:

  1. Missing indexing columns that matter
  2. lack of cleanup from tables that log data
  3. May need to denormalise data
  4. Place some commonly used data onto memcached
  5. Optimize certain section of the code that have to do with string rendering
Advertisements