The first thing you need to know it is that this is not a MySQL problem. It might not be a problem with your MySQL configuration, queries and hardware, even though fixing these does help in many cases. Whatever powerful and well tuned system you have, if you put a concurrent load on it that is too heavy the response times will increase and user experience will suffer.
So what you can do to prevent this problem from happening ? The answer is easy. Throttle the side load so it does not consume too many system resources. Here are some specific techniques to use.
Do not push concurrency too high:
Many developers will test scripts with multiple levels of concurrency and find out doing work from 32 processes is faster than just having one process. This is true if you have the system completely at your disposal. However, if you need the system to serve other users too you typically need to reduce concurrency to where it does not overload the system. Unless it is a really time critical process I would not use more than 4 parallel processes heavily writing to the database.
Sometimes even a single process overloads system too much. In this case throttling by having relatively short queries and introducing “sleeps” between them can be a good idea. It also often helps with monopolizing replication threads. For example if I need to delete old data instead of DELETE FROM TBL WHERE ts<"2010-01-01" I’ll do “DELETE FROM TBL WHERE TS<"2010-01-01" LIMIT 1000 in the loop until no more rows need to be deleted. I may inject “sleep” between iterations that are as long as query execution – so the longer queries run and the more the system is loaded the more “rest” it will get. Alternatively you can look at “threads_running” variable which is a very good simple identifier of the current load and sleep based on its value – for example you may want chose to pause the script if the load is too high and wait for threads_running to go below certain value.
It also often helps to look into your cron or other scheduling system you’re using. Frequently way too many scripts can be started at once, or very close to each other so they start to overlap and produce the overload. Solutions could be spacing them out, introducing some “job control” to ensure scripts do not run in parallel if they should not (so you don't get many copies of same script running at once). One simple solution is instead of having a bunch of scripts scheduled at midnight, 1AM, 2AM to start I can put them into nightly.sh one after another and schedule that to run at midnight – this way I get scripts to run one after another at their own pace.
I remember listening to Cary Millsap’s talk once and he recommended moving the load in time and space as an optimization technique. We spoke about moving load in time before, but we also can move in space – putting it on a different system, which in MySQL space is most commonly a dedicated slave. In a lot of environments, especially with a low level of operational/development discipline to enforce previous solutions, it can be a life saver. Of course, it only works for read jobs, which is an important limitation. Getting slave(s) for batch jobs also can help in other ways too – such as competition for buffer pool between different kinds of workloads is reduced.
Surprisingly simple but effective, setting innodb_old_blocks_time=1000 can often be very helpful in avoiding batch jobs washing away buffer pool contents and so making normal user queries a lot more disk bound and slower. I wrote about it in more detail few months ago.
Finally lets touch upon the discovery question. To deal with load management you need to understand when the problem is happening in your environment (we want to catch it before users complain right?) and if it does what jobs exactly cause the overload. In complex environments it might be a hard question. pt-stalk is a great tool for this purpose. Getting it running can help you to collect the state of your system when it was overloaded with side load. Analyzing the wealth of data it generates will most likely contain the answers you’re looking for.