I see a new input in SoS 3.0:
NEW DATA INPUT! - Scripted input 'nfs-iostat_sos.py' is now available to monitor the I/O usage of pooled search-heads on the shared NFS device.
I don't see any dashboards that use the data collected by the script. Any plans for this?
Absolutely! In a future release (hopefully, the next one), we plan to ship a view using the events generated by this scripted input to show the I/O bandwidth usage and the performance and responsiveness of the shared NFS device in a search-head pool.
In the meantime, you can always run manual searches against the data collected by nfs-iostat_sos.py
to that end.
Here's a couple of simple examples:
Median IOPS and total OP count per OP type for the past 5 minutes:
index=sos source="nfs-iostat_sos.py" earliest=-5m
| stats sum(op_count) median(ops_per_sec) by op_type
Worst-case round-trip time (RTT) for GETATTR, LOOKUP, ACCESS calls:
index=sos source="nfs-iostat_sos.py" (op_type=GETATTR OR op_type=LOOKUP OR op_type=ACCESS)
| timechart max(rtt_per_op) by op_type
Absolutely! In a future release (hopefully, the next one), we plan to ship a view using the events generated by this scripted input to show the I/O bandwidth usage and the performance and responsiveness of the shared NFS device in a search-head pool.
In the meantime, you can always run manual searches against the data collected by nfs-iostat_sos.py
to that end.
Here's a couple of simple examples:
Median IOPS and total OP count per OP type for the past 5 minutes:
index=sos source="nfs-iostat_sos.py" earliest=-5m
| stats sum(op_count) median(ops_per_sec) by op_type
Worst-case round-trip time (RTT) for GETATTR, LOOKUP, ACCESS calls:
index=sos source="nfs-iostat_sos.py" (op_type=GETATTR OR op_type=LOOKUP OR op_type=ACCESS)
| timechart max(rtt_per_op) by op_type
Or I guess my real question is what is the best way to track this over a long period of time (like 90 days) so that we can determine if things are similar today as 90 days ago. Something like: index=sos source="nfs-iostat_sos.py" op_type=getattr earliest=-90d | timechart span=1d avg(kBps) by host
Good information. Thank you. Now what is considered good/normal vs bad? GETATTR skews the information. Your first query has 1.9M for GETATTR for sum(op_count) vs 190K for LOOKUP (which is the next highest).