Where security isn’t enough for jenkins


recently I set up a Jenkins server with login removed, and I got complains from the higher ups that it is the wrong way of doing things, so I have to add it back.

unfortunately when I check the jenkins config.xml, useSecurity is changed to true yet the application still disable login. So what I did is that I download a separate localhost jenkins to do a file comparison, and low and behold, there are other changes needed to revert back the login.

Please ensure that your Jenkins config.xml has the following setup:


  <authorizationStrategy class="hudson.security.FullControlOnceLoggedInAuthorizationStrategy">



  <securityRealm class="hudson.security.HudsonPrivateSecurityRealm">





Creating Solr Cloud part 1 – zookeeper ensemble


Solr is a useful full text search tool. Think Google. It helps you to easily search for contents you want. Today I want to cover how to set up the zookeeper ensemble

  • Get 3 instances in AWS / GCE. for this example we use ubuntu OS
  • We are not using the ubuntu zookeeper, as i find out it is not working well. download the JDK file using this command
    1. wget http://download.oracle.com/otn-pub/java/jdk/8u91-b14/jdk-8u91-linux-x64.tar.gz
  • go to /etc/environment document and add in the new java bin
    1. PATH=”/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/jdk1.8.0_91/bin”

  • ensure that your java version is 1.8 and above
  • download zookeeper
    1. wget http://download.nus.edu.sg/mirror/apache/zookeeper/stable/zookeeper-3.4.8.tar.gz
  • unzip it, and rename zoo_sample.cfg to zoo.cfg in the conf folder
  • copy / paste this. the dataDir can be the directory of your choice
  • at the  dataDir, create this file myid. Chmod it to 777. Key in one value called 1. Why 1? because of your zoo.cfg settings, which you set server.1 as your first server
  • do the same for the other 2 servers, except the myid file must change accordingly to 2 and 3
  • lastly run bin/zkServer.sh start

Once you run the zookeeper, you can tell that it is running by this command ps -ef | grep zoo

do note to tell iptables or security groups to keep port 2888 and 3888 open

Better respect for developers


I was into a conversation that got me pretty hot under the collar. I was talking to this project manager about how he manage projects, and his views on software development. He was pretty upset about dual track career path the company is implementing, one is business track and one is technical track.

I asked him why is it so. He said the technical track is useless. No developer will stay developer for more than 3 years. And he added that developers will always be cost centers so adding a technical track will increase cost for the company.

But one needs to read wikipedia to understand the importance of cost centers. One need to understand that it is NOT possible for the company to be a complete 100% profit center and no cost center. And server admins / system architects do take pride in setting up HA architecture that helps to support millions of users at the fraction of the cost, and that takes skills, experience and expertise. Nowadays on the cloud, it is so easy to create a server at the click of the button. Cost control take an upmost importance.

Companies that value technical talent should have a technical track and not force developers to be promoted to project managers that ends up forcing the peter principle on them. Companies should also allow developers to scope down their job scope to a certain technology (e.g. JavaScript) that will help push the limits of the company’s product, instead of staying as a full stack developer. Roles are Senior Developers and Senior Architects helps give tech guys a sense of responsibility and guide the younger ones

In conclusion, I think the blame culture on developers that they are the reason of the burn rate needs to change. HR and management need to work together to come up with policies to retain technical talent and hence when undergrads from JC or secondary school wants to choose a course, they will tend to choose engineer courses.

When staging environment differs from production


We often set up staging as a mirror of production where we can do QA and load tests to ensure that the code is “production ready” before we tag it and push it out to production

But sometimes it is often not that straightforward. There are times where the staging behaviour does not reflect the production behaviour and we ask ourselves why. For example, certain API calls or web pages seems to take much longer to download in production compared to staging even though the app code is the same.

Often it is the case of data mismatch. You may have only a few hundred users in staging environment but your production is already in the league of millions of users. It could also be that the production data for certain datatype (e.g. text) is excessive or use certain symbols or characters that hinders server performance.

So it comes to me a situation where I was asked to optimise an API call, and I could not get the same behaviour on Staging environment. This is what I did:

  1. Grab the latest copy of the database from the production and restore it into staging.
  2. Take note of the usernames of users who frequently log in in production
  3. Use their IDs and do a load test using locust.io (You may need to make some changes to the authentication module in staging)
  4. Tweak the weightage of the API / Web url in locust.io till your staging behave like production. Also match the RPS load.

Most of the time, you don’t need to match the staging and the production instance type. Staging instance can be slightly weaker than production. From there, you will be able to discern how to optimise. Some of them are:

  1. Missing indexing columns that matter
  2. lack of cleanup from tables that log data
  3. May need to denormalise data
  4. Place some commonly used data onto memcached
  5. Optimize certain section of the code that have to do with string rendering

allowing tty in a Amazon EC2 instance


Fabric has been quite an unique tool to me. I encounter a few quirks that I want to share with my readers.

Users of Fabric know that in order to run the init service, you have to switch off the pty. Recently I encountered while trying to run some service on init.d on an amazon instance. It just can’t run no matter what I do. Then I realised that AWS EC2 instances require tty by default. Hence you need to switch it off before you can run your fabric commands

Here is my stackoverflow question where I ownself answer ownself. (Only Singaporeans know this :-p  )

Fdisk in fabric


One of the issues in Fabric is every command, sudo() or run() is equivalent to a new session. Hence it is difficult to do a fdisk command as you need to have an interaction.

In bash script, you overcome this problem by using a pipe. You put the inputs that you want before the | symbol so bash knows what commands to key in

So how do you translate that into a fabric command? I have been exploring around for months, trying to tinker with different ways of running fdisk. Finally I managed to come up with a working command as shown below:

sudo('echo "g\r\n\
"|fdisk /dev/xvdb' %('\n', '\n'))

Watch out for fabric pitfalls


Fabric has been a very useful tool. It helps a lot in simplifying and documenting server deployments.

Recently I am tasked to convert a bash script web server setup file to a fabric deployment file. Of course due to some unfamiliarity with fabric, I thought by converting every single line of the bash script to fabric run(‘<command line from bash>’) it would work. Man, I was proven wrong again and again.

So I am blogging some of the pitfalls I have fallen into so that you don’t repeat my mistakes:

Bash Script: cd /opt/mongodb

Fabric: run('cd /opt/mono') WRONG

run don’t accept cd. It does not tell the server to go to the folder. You need to use another library fabric.context_manager.cd. The correct way is

with cd('/opt/mono'):

git clone http://github.com/mono.git

Bash Script: export CONFIG_PATH='/usr/lib/config'

Fabric: run("export CONFIG_PATH='/usr/lib/config'") WRONG

You cannot use run command in fabric to set environment variables. You need to use shell_env function, also under fabric context_manger library

with shell_env(CONFIG_PATH='/usr/lib/config')


Finally when you put files, it is advisable that you put(local filename, remote filename/folder). I experience that if you put both parameters as folders, the files that you transferred end up becomes 0 bytes

Load test 101


Load test can be a specialist trade. There are a lot of wealth of knowledge and tools out there in the market. And it is very important part of making sure your architecture resilient in the face of huge load of request calls

So, where do we start? Basically load test is separated into 2 different categories:

Car Squeeze

In a car squeeze load test, you try to cramp in as much users into the system as possible, slowly one at a time, till you got a response time so bad the app/web is not useable. In a car squeeze way of load test, you set your number of users as high as possible but the user spawn rate as low as possible

Something like this

Screen Shot 2015-10-22 at 5.08.49 pm

With this type of load test, you will know on one certain type of instance, what is the maximum possible limit a cloud instance can hold till the response time goes sky high. So if the marketing ladies tell you to expect a certain number of users, you just take their number and divide your number and roughly you have a system that should theorically hold

Great Sale Rush

Image Great Singapore Sale with many shoppers camping for iPhone 7. Once the time has reached, the doors open and all the mad shoppers flood in. So imagine a load test where you tried to squeeze as much request through the narrow load balancer door as possible.

Such extreme squeeze can result in unforeseen consequences like a complete collapse of the servers due to the sudden huge load. It is always the concern of system architects to deal with this scenario. One way is to use throttling of users to make sure that the incoming load gets even out over time. And if you want to test it on locust, you should try this setting. Not for the shallow pockets though. You need a dozen of CPU intensive instances to pull this off.

Screen Shot 2015-10-22 at 5.13.16 pm

Do remember to activate your autoscaling so as to check if the balancers can spawn instances fast enough to cope with the sudden surge of users wanting to come in. It could be the instances took too long to spawn and cause a performance loss, or instances spin up fast enough but the app within the instance took a long time to warm up.

Do use monit to ensure uptime of services (nginx) to ensure that any moment the daemon wants to take a break, Monit will wake them up

So what else do I have to look at in load testing?

Well users can be very biased towards certain page that have attractive promotions? Or certain API call can be unusually frequent for a long period of time (like ranking in gaming). Hence you need to run through scenarios where for instance certain page or API call have a high probability of 0.8, which is a very high frequency. When run those tests, check for cache (memcached, redis) for any hot key issues and also check the database if there are any performance hits.

Load, Loader, Locust

Screen Shot 2015-07-20 at 11.41.25 AM

I recently tried out using a new load testing tool called locust.io. Amazing tool. It allows you a great free will to customise your load test. Also it allows master-slave cluster setup that gives you the power to blast 100K DAU/HAU type of load if you want it.

A big thanks to Alvin Yeoh, a fellow colleague who helped me with the setup

Below are the steps on how to set up a master-slave locust.io

  1. Set up two instances, one master the other save.
  2. Install the following packages into both instances
    1. sudo yum install python27-devel.x86_64
    2. sudo yum install gcc
    3. sudo pip install locustio
    4. sudo yum install gcc-c++.noarch
    5. sudo yum install git
    6. sudo yum install mysql-devel
    7. sudo pip install MySQL-python

On your slave client run this command:

locust --slave --master-host=MASTER_IP --host=TARGET_URL

On the master client run this command:

locust --master

Dealing with hot keys


Memcached keys can get hot. I mean real hot to the point that it can hinder your server performance. Below is one of the screenshots of an app in new relic


Of course the application ends up in flames as the key gets too hot to handle


So how do one solve the issue of hot keys?

Identify the hot keys using MCTOP

MCTOP (means memcached TOP) is a ruby based gem that helps a sys admin to identify the hot key. Usually hot keys are keys that are large in size, so filter the list based on key size and then identify the largest keys available

Move the data to CDN

If the size is above 1MB and the data don’t change frequently (aka Master data) then you can consider putting them on CDN (Content Delivery Network). CDN is a cost effective way to quickly bring content to the masses, removing one huge bottlenecks in the system

Have duplicates of the key in a cache farm

Another way is to use McRouter or Twemproxy to help you managed the hot key. They can run multiple cache servers with duplicate cache data. Remember, replication is the key to high availability.

Break down the data to smaller multiple key values

Instead of doing a select * and put the whole table into one key, one can consider having <table>_<id> key and <row> array value. It helps as the size of the value is smaller

Use replication groups in AWS Elastic Cache

if you use AWS Elastic Cache, replication groups helps to reduce the bandwidth bottlenecks that is associated with hot keys, as the same request need not go to one node but multiple nodes