Monitoring Splunk

Splunk Windows directory monitoring results in 30x "logfile size" to "data indexed" volume

vngzs
Engager

Summary

For monitoring Windows directories, Splunk is reporting roughly 30 times the index volume versus the actual file contents themselves.

Details

I have a directory indexed by Splunk.

PS C:\logs> cat 'C:\Program Files\SplunkUniversalForwarder\etc\apps\search\local\inputs.conf'
[monitor://C:\logs]
disabled = false

This has only logfiles like the following - going back a month:

PS C:\logs> get-childItem . | select Name,FileSize,length

Name                            FileSize                     Length
----                            --------                     ------
x-log-2018-07-01.H00.txt       297 B                           297

        [snip]

x-log-2018-07-30.H03.txt       {22.54 MB , 23,081.98 KB } 23635943
x-log-2018-07-30.H04.txt       {45.31 MB , 46,398.05 KB } 47511605
x-log-2018-07-30.H05.txt       {47.31 MB , 48,448.86 KB } 49611636
x-log-2018-07-30.H06.txt       {31.69 MB , 32,454.59 KB } 33233498
x-log-2018-07-30.H07.txt       {21.16 MB , 21,670.09 KB } 22190177
x-log-2018-07-30.H08.txt       {12.32 MB , 12,620.59 KB } 12923489
x-log-2018-07-30.H09.txt       {9.03 MB , 9,245.70 KB }    9467595
x-log-2018-07-30.H10.txt       {15.87 MB , 16,254.70 KB } 16644816
x-log-2018-07-30.H11.txt       {48.31 MB , 49,470.80 KB } 50658101
x-log-2018-07-30.H12.txt       {5.46 MB , 5,595.05 KB }    5729335
x-log-2018-07-30.H13.txt       {37.36 MB , 38,260.37 KB } 39178621
x-log-2018-07-30.H14.txt       {34.75 MB , 35,584.42 KB } 36438450
x-log-2018-07-30.H15.txt       {13.91 MB , 14,244.40 KB } 14586261
x-log-2018-07-30.H16.txt       {12.41 MB , 12,703.72 KB } 13008605
x-log-2018-07-30.H17.txt       {8.41 MB , 8,611.08 KB }    8817743
x-log-2018-07-30.H18.txt       {6.43 MB , 6,588.22 KB }    6746340
x-log-2018-07-30.H19.txt       {24.83 MB , 25,424.69 KB } 26034884
x-log-2018-07-30.H20.txt       {24.60 MB , 25,194.88 KB } 25799554
x-log-2018-07-30.H21.txt       {48.48 MB , 49,643.52 KB } 50834964

A new file is created every hour, and logs are appended to it. So we have 24 files per day.

This understandably resulted in a very high initial index volume when we added the directory. OK, fine. But when I view the Data Volume Calculator - Max Sources, I find that Splunk sees each of these logfiles as ~700 MiB in size, when they are only ~20 MiB each. This results in ~15 GiB/day license usage - from just this host! Multiplied across our whole fleet of Windows machines with this logging pattern, we would need to purchase a license far more expensive than necessary for the actual indexed data volume.

That a daily license usage equivalent to the *entire month's worth of logs in the directory!

PS C:\logs> "{0:N2} MB" -f ((Get-ChildItem -Recurse | Measure-Object -Property Length -Sum -ErrorAction Stop).Sum / 1MB)

14,203.00 MB

Question

Is there anything obvious I am missing here? Is there any debugging I can do to investigate this? I am currently locked out of search, but I can view the Splunk _internal index.

Edit: Minor update: I've regained full search capabilities through contact with Splunk Inc as we sort this out.

0 Karma

soumyasaha25
Contributor

are you sure splunk is not reindexing the log files, also is are these log files rotated into a zip file in the same directory.

vngzs
Engager

The log files are being "rotated". However, this happens by creating a new .*\.H[0-9]{2}.txt (regex) file each hour. The LastWriteTime does not change for old logs throughout the day. For example, 16 hours have passed today, so the logs since midnight look like:

 PS C:\logs> ls


     Directory: C:\logs


 Mode                LastWriteTime         Length Name
 ----                -------------         ------ ----
 -a----         8/1/2018   1:00 AM       26936262 x-log-2018-08-01.H00.txt
 -a----         8/1/2018   2:00 AM       50616478 x-log-2018-08-01.H01.txt
 -a----         8/1/2018   3:00 AM       28502533 x-log-2018-08-01.H02.txt
 -a----         8/1/2018   4:00 AM       34602771 x-log-2018-08-01.H03.txt
 -a----         8/1/2018   5:00 AM       25583212 x-log-2018-08-01.H04.txt
 -a----         8/1/2018   6:00 AM       14175663 x-log-2018-08-01.H05.txt
 -a----         8/1/2018   7:00 AM        4641521 x-log-2018-08-01.H06.txt
 -a----         8/1/2018   8:00 AM       36055588 x-log-2018-08-01.H07.txt
 -a----         8/1/2018   8:59 AM       17373634 x-log-2018-08-01.H08.txt
 -a----         8/1/2018  10:00 AM       37514160 x-log-2018-08-01.H09.txt
 -a----         8/1/2018  10:59 AM        4530699 x-log-2018-08-01.H10.txt
 -a----         8/1/2018  12:00 PM       26898780 x-log-2018-08-01.H11.txt
 -a----         8/1/2018   1:00 PM        7231590 x-log-2018-08-01.H12.txt
 -a----         8/1/2018   2:00 PM       48213412 x-log-2018-08-01.H13.txt
 -a----         8/1/2018   3:00 PM        4970614 x-log-2018-08-01.H14.txt
 -a----         8/1/2018   4:00 PM       48001484 x-log-2018-08-01.H15.txt
 -a----         8/1/2018   4:34 PM       22850024 x-log-2018-08-01.H16.txt

The total volume does, however, appear to be consistent with Splunk indexing the entire directory each time a new log is created in the C:\logs directory. I am not sure how to make it only index files in that directory that it has not already indexed - I would expect it to only index new files, but perhaps each time a new file is created, Splunk is reindexing the entire directory regardless of which files are new.

0 Karma

sudosplunk
Motivator

Hello,

Can you provide the output of below search. Run it for last 10 days to see how license usage trend look like.
index=_internal source=*license_usage.log type=Usage
| search s=
| eval MB=round(b/1024/1024,2)
| timechart span=1d sum(MB) by s

0 Karma

vngzs
Engager

I narrowed the search to just the host used in the example here, and filtered out the logs by doing

index=_internal source=*license_usage.log type=Usage | search h="REDACTED" | rex field=s ".*-log-2018-[0-9]+-[0-9]+.(?<lf>.*).txt" | eval MB=round(b/1024/1024,2) | timechart span=1d sum(MB) by lf

I believe this query expresses your intent behind the search posted.

Here is the output:
logfile size by date

(the H[0-9]{2} part of the log enumerates the rotation)

We see a pretty consistent trend of ~700 MiB per file. As you can see in the directory listing from my first post, however, the log files are only ~20 MiB in size.

0 Karma

sudosplunk
Motivator

Just making sure, you don't have crcSalt = <SOURCE> defined in your inputs.conf right. The reason I ask is, this setting will force the input to consume files that have matching CRCs (cyclic redundancy checks) which could lead to the log file being re-indexed after it has rolled.

You can check this by running the following command:
Go to \SplunkUniversalForwarder\bin and run, splunk cmd btool inputs list --debug

One more thing to check: For your license usage query, try expanding the time range to the first time you started collecting these logs. This will tell us if splunk has been doing this from the start or something changed over the time.

0 Karma

vngzs
Engager

The output for C:\logs from the command above gives me no crcSalt:

C:\Program Files\SplunkUniversalForwarder\etc\apps\search\local\inputs.conf                     [monitor://C:\logs]
C:\Program Files\SplunkUniversalForwarder\etc\system\default\inputs.conf                        _rcvbuf = 1572864
C:\Program Files\SplunkUniversalForwarder\etc\apps\search\local\inputs.conf                     disabled = false
C:\Program Files\SplunkUniversalForwarder\etc\system\default\inputs.conf                        evt_dc_name =
C:\Program Files\SplunkUniversalForwarder\etc\system\default\inputs.conf                        evt_dns_name =
C:\Program Files\SplunkUniversalForwarder\etc\system\default\inputs.conf                        evt_resolve_ad_obj = 0
C:\Program Files\SplunkUniversalForwarder\etc\system\local\inputs.conf                          host = REDACTED
C:\Program Files\SplunkUniversalForwarder\etc\system\default\inputs.conf                        index = default

In the image I posted, the query time is actually the full time range since our Splunk install (it's less than a week old!). 2018-07-27 was the first day we started indexing the directory. Interestingly enough, only on that first day did we get the correct filesizes - each subsequent day has resulted in too much index usage.

0 Karma

sudosplunk
Motivator

Thank you. Let me ask you one more question, what is the sourcetype for these logs? I don't see any sourcetype definitions in your inputs.conf above.

0 Karma

vngzs
Engager

OK. sourcetype is not defined, and it defaults to sourcetype = x-log-2018-07-10.H with source = C:\logs\x-log-2018-08-01.H18.txt.

0 Karma

sudosplunk
Motivator

All your log files are having the same sourcetype = x-log-2018-07-10.H? Are you sure the sourcetype is not changing?

0 Karma

vngzs
Engager

OK. Did a longer search. Sometimes they do change: certain logs actually have sourcetype = x-log-2018-08-01.H-too_small. Others have x-log-2018-07-27.Has their sourcetype. I'm not sure of the implications of this - would it be better to define a custom source type for these logs in my inputs.conf?

0 Karma

sudosplunk
Motivator

Yes. It is always a best practice to define the sourcetype explicitly (in inputs.conf) before ingesting data.
I am not 100% sure - I think this can be one of the reasons for reindexing. To check this, compare license usage for source and sourcetype NOT host.

PS: My other question to you, how are you doing event breaking and timestamp extraction for your data. Are these settings defined at source level or host level?

0 Karma
Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...