If I should have made some safe bets on the near future, I would choose two: Hadoop and MongoDB.
MySQL Sharding was a major issue for large scale installations and it is the same for mongoDB large installations.
Back to Basics
Mongo is pretty similar to a regular database, but it has two main advantages: 1) Software engineers love it as it can easily be used for object persist-ency and 2) it support unstructured objects (documents) that can easily store different objects based on the same virtual class.
- Database: database
- Collections: very similar to tables.
- Documents: very similar to rows. Yet, a document can be as flexible as a JSON document can be. For example, it may include 1 to many fields in the document itself.
- mongod: a mongoDB instance or shard.
- Chunk: a 64MB storage unit that stores documents.
- Config database: Chunks to mongos mapping directory.
- Support large dataset using commodity servers.
- Support high IO requirements using commodity disks.
- Range-based Data Partitioning: a very similar method to MySQL partitioning. You should choose one or more fields (shard key) that sharding will be based on. You should choose a shard key according to the business logic, like splitting according to account id in a SaaS application.
- Automatic Data Volume Distribution: mongoDB will take care of the shards balancing by itself according to the chosen shard key.
- Transparent Query Routing: mongoDB takes care of queries map reduce to multiple shared by itself when a query does not match the shard key (very much like Hadoop).
- Sufficient Carnality: choose a shard key that can be split later to more shards if a database size is getting too large (exceeds chunk size).
- Uniform Distribution: choose a sharding key that will spread a in uniform distribution to avoid unbalanced design.
- Distribute Write Operations: if you have a billing system, prefer to shard according to account id rather than shard according to billing month. Otherwise, in a given day, probably only a single shard will be used.
- Query according to the shard key: if any of your queries will include the shard key, each of your queries will result in a single shard query. Otherwise, it will generate N queries (one per shard).
- Every sharded collection must have an index that its first fields are the shard key (use shardCollection for that).
- Chunk size default limit is 64MB
- When a chunk reaches this limit, mongoDB will split it to two.
- If chunks are not distributed uniformly, mongoDB will start migrating chunks between different mongos.
- Cluster Balancer is taking care of this process.
- Balancing can cause performance issues and therefore can be restricted to off peak hours (nights and weekends for example) using balancing windows.
- The shards mapping to mongos is saved at the config database.
- Replication should be considered as well a complementary method.