Apparently, process checking is critical. Yes, we already have tons of Linux tools and tips available. Getting familiar with your weapons is actually the first step and the easiest part.
More importantly, are what questions you ask, and what for, when approaching your critical process. Fortunately, with common sense we can dig out lots of valuable information.
Assumptions Before Deep Dive
Here we assume you are familiar with:
- FD (file descriptor): Everything in Linux is a file.
- /proc pseudo filesystem: How Linux kernel exposes in-depth information of process.
- lsof, top, ps, grep: First time heard of them? Excuse me?
Basic Check For Linux Process
When the Process Is Started and How Long It Runs
This helps us to detect whether an unexpected or suspicious service restart has happened. As a supplementary, decent service will always do proper logging, which can confirm our observation.
# Get start time by pid ps -eo pid,comm,etime,user | grep $pid # Sample output: root@s1:~# ps -eo pid,comm,etime,user \ | grep 20513 20513 dockerd 8-00:58:30 root # It means 8 days, 58 min and 30 sec
Where Is the Log File?
A very common question, especially from Dev or QA. Usually, process will do continuous logging. Thus it holds fd of log files. lsof can list all fd opened by the process. So you don't need to ask anyone to find out the answer!
# Find out log files by pid lsof -P -n -p $pid | grep ".*log$" # Sample output: # root@s1/# lsof -p 40 | grep ".*log$" # daemon .. /var/log/jenkins/jenkins.log # daemon .. /var/log/jenkins/jenkins.log # Check log files for error/exceptions grep -C 3 -iE "exception|error" $logfile
How Much CPU and Memory the Process Takes
We certainly need to be on top of any abnormal resource utilization1. Fortunately, almost all modern monitoring systems enable us to see the history — a big plus for troubleshooting.
# Check process resource utilization top -p $pid
What's the Command Line Starting the Process?
People ask this question when they're required to manage unfamiliar or uncomfortable services. A more urgent case: the stupid service just mysteriously refuses to start. Wrong java opts? File permission issue? The process command line can give us some insight or hints.
# Find out process start command line cat /proc/$pid/cmdline
What TCP Ports Are Listening by the Process?
Nowdays the majority of service are web-based or micro-services. It helps if we can understand what TCP ports the process is listening.
# Check what ports are serving lsof -P -n -p $pid | grep -i listen # Check whether given port is listening lsof -i tcp:$tcp_port
How Many fd the Process Is Opening
Usually too many fd opening is a bad sign, say over 3000: a bad design makes application is inefficient for handling requests; fd resource leak; too many requests exceeding our expectation.
# Get total fd count opened by pid lsof -p $pid | wc -l
Advanced Check For Linux Process
Check How Resident Memory is Used by the Process
This is especially important when the process is taking way too much memory. pmap reports memory map of a process2.
# Display detail memory usage pmap -x $pid
Find Out Process Tree
For mult-threading process, displaying all threads and their starting commands might be helpful. It gives us very good insight.
# Get all threads for a given process pstree -A -a -p $pid # keep checking process tree watch "pstree -A -a -p $pid"
Detect Long TCP Connections and How Long They Have Been Running
Watch out long TCP connections. Daemon service might not only take requests, but also initiate connections. Developers may keep long tcp connections from applications to DB services. When app nodes and DB nodes are disconnected or db instances are restarted, will your process survive from the chaos and behave functionally?
# List TCP connections it starts lsof -p $pid | grep ESTABLISHED # Check create/update time for given fd stat /proc/$pid/fd/$fd_num # Sample: # root@s1:~# date # Fri Sep 23 23:22:22 EDT 2016 # #root@s1:# lsof -p 265 |grep ESTABLISHED # 134u . 47..33 s1:59427->s2:9300 (EST.. # 140u . 47..10 s1:38078->s2:9300 (EST.. # 142u . 47..11 s1:38079->s2:9300 (EST.. # 143u . 47..81 s1:51033->s2:9300 (EST.. # # root@s1:~# stat /proc/265/fd/134 # File: /proc/265/fd/134->socket:[47.. # Size: 64 Blocks: 0 .. # Device: 3h/3d Inode: 463..8 Links:.. # Access: (0700/lrwx------) Uid: (0/.. # Access: 2016-09-23 19:50:12... -0400 # Modify: 2016-09-05 19:48:05... -0400 # Change: 2016-09-05 19:48:05... -0400
How to Detect FD Leak
If application keeps opening files or sockets without gracefully closing them, it's a FD leak issue. Its fd count will keep rising, and eventually, the process will crash. Usually, this happens in problematic error handling logic.
# Get total fd count by pid lsof -p $pid | wc -l
What Files Are Being Downloading and What Is the Progress Status?
The application might be stuck doing heavy internet request, e.g downloading huge files. To dig out the detail status, we can get the fd, which should be regular file and in write mode. Then keep polling file size to understand where we are.
# Get REG(regular) fd with write mode lsof -p $pid | grep REG | grep "w " # Check file size watch "ls -lth /proc/$pid/fd/$fd_num"
Check Files Are Deleted but Not Gracefully Closed
When files are removed somehow, your process might still hold the stale fd. Or even try to read or write the file. This should be definitely avoided and get developers alerted.
# List unexpected file deletion lsof -p $pid | grep deleted