Solved: SoS 3.0 nfsiostat

rmorlen · ‎05-21-2013

I see a new input in SoS 3.0:

NEW DATA INPUT! - Scripted input 'nfs-iostat_sos.py' is now available to monitor the I/O usage of pooled search-heads on the shared NFS device.

I don't see any dashboards that use the data collected by the script. Any plans for this?

hexx · ‎06-07-2013

Absolutely! In a future release (hopefully, the next one), we plan to ship a view using the events generated by this scripted input to show the I/O bandwidth usage and the performance and responsiveness of the shared NFS device in a search-head pool.

In the meantime, you can always run manual searches against the data collected by nfs-iostat_sos.py to that end.

Here's a couple of simple examples:

Median IOPS and total OP count per OP type for the past 5 minutes:

index=sos source="nfs-iostat_sos.py" earliest=-5m | stats sum(op_count) median(ops_per_sec) by op_type
Worst-case round-trip time (RTT) for GETATTR, LOOKUP, ACCESS calls:

index=sos source="nfs-iostat_sos.py" (op_type=GETATTR OR op_type=LOOKUP OR op_type=ACCESS) | timechart max(rtt_per_op) by op_type

View solution in original post

hexx · ‎06-07-2013

Absolutely! In a future release (hopefully, the next one), we plan to ship a view using the events generated by this scripted input to show the I/O bandwidth usage and the performance and responsiveness of the shared NFS device in a search-head pool.

In the meantime, you can always run manual searches against the data collected by nfs-iostat_sos.py to that end.

Here's a couple of simple examples:

Median IOPS and total OP count per OP type for the past 5 minutes:

index=sos source="nfs-iostat_sos.py" earliest=-5m | stats sum(op_count) median(ops_per_sec) by op_type
Worst-case round-trip time (RTT) for GETATTR, LOOKUP, ACCESS calls:

index=sos source="nfs-iostat_sos.py" (op_type=GETATTR OR op_type=LOOKUP OR op_type=ACCESS) | timechart max(rtt_per_op) by op_type

rmorlen · ‎06-11-2013

Or I guess my real question is what is the best way to track this over a long period of time (like 90 days) so that we can determine if things are similar today as 90 days ago. Something like: index=sos source="nfs-iostat_sos.py" op_type=getattr earliest=-90d | timechart span=1d avg(kBps) by host

rmorlen · ‎06-11-2013

Good information. Thank you. Now what is considered good/normal vs bad? GETATTR skews the information. Your first query has 1.9M for GETATTR for sum(op_count) vs 190K for LOOKUP (which is the next highest).

SoS 3.0 nfsiostat

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms