I have seen all sorts of deployment. From the very worst to the best. A lot of the poor deployments are due to database scalability issues not factored in when designing the architecture at the beginning. Hence when there is a sudden surge of users, the database is unable to scale well, causing issues to the application as a whole.
Hence database sharding is important. Why you would want to shard your database?
- You don’t want downtime, or you want downtime to be as minimal as possible.
- smaller chunks of data is easier to backup or replicate
- Queries can be faster as the query load is divided up onto different hosts
Question: what if your old application uses MySQL and you or the management do not want to enlist the help of dbShards to help you manage your data? And you think that MongoDB is unstable (Yes don’t laugh. I met such people before). You will have to do it on your own.
Here I list down some deployments that I have seen. You can decide yourself which suits you best.
1) Table sharding within the database
Say you have a table called users, and the number expands exponentially. So what I saw some legacy system did is to create multiple tables appended with a _<number> behind the table name, like users_1, users_2 etc. So lets say you have 300K users, you break it up to 100K users on each table.
I strongly do not recommend this approach. The only advantage that I see is you can easily archive old data by dumping the oldest table. Other than that, the downsides are many. You have to modify your queries based on the primary key (if id is 100001, you have to query table user_1 instead of user). It does not help to divide the query load on the database host.
2) Different database on different hosts containing different tables.
Say you have db1 that contain table users on host 10.0.0.1 plus user related tables with foreign key user_id, then you have db2 that contains characters table on host 10.0.0.2. Different host hosted different databases, which hosted different groups of tables.
I don’t recommend this setup too. Reason is you cannot do JOIN queries across tables that are located on different hosts. You will end up needing to query on one db to get the data, then use the id to query for data on another db and chances is that you will have a lot of foreach loops everywhere. It will stress your application server
3) Multiple hosts with the same database schema
Say you have 2 hosts, 10.0.0.1 and 10.0.0.2 and both hosts have database called db. The name do not clash as they are hosted on different server. Both have users and characters, but users table are sharded (which means user id 1001 on one host, user id 2001 on the other host), while characters data is the master data, similar on both databases.
I would recommend this setup. Reasons are you do not need to change your queries based on the id as your table names and db names are the same. Also you can do cross table join queries within the shard. You just need to overcome a few challenge with this setup
i) You need to classify which tables are sharded tables and which tables are master tables. Usually the rule of thumb is tables that constantly have new records added are sharded tabled, like users when people keep signing up. Then from there, tables that are related to that table, that contains its primary key as the foreign key, needs to be sharded as well. As for global tables, you need to make sure that the data is the same for all the shards so that cross table queries are possible.
ii) You need to determine a mechanism how to shard the data. Which means autoincrement is a no-no for the table you want to shard. You may want to put in a logic that the first number of the user id table determines which shard the data is in, so id 1001 is in the first shard then 2001 is in the second shard. You may also want to think how to do a round robin sharding to make sure that the amount of data in both database shards grows evenly and not one shard outgrows the other.