Splunk Search

What exactly is a tsdix file?

aoliullah
Path Finder

what exactly is a tsidx file? Can someone explain please? I don't quite understand the definition:

"A tsidx file associates each unique keyword in your data with location references to events(??), which are stored in a companion rawdata file"

I ask this in relation to tstats command which states "Use the tstats command to perform statistical queries on indexed fields in tsidx files".

Can someone explain this in context to tstats?

1 Solution

s2_splunk
Splunk Employee
Splunk Employee

tsidx (time series index) files are created as part of the indexing pipeline processing. The incoming data is parsed into terms (think 'words' delimited by certain characters) and this list of terms is then stored along with offset (a number) that represents the location in the rawdata file (journal.gz) that the event data is written to.
It is the exact same thing as an index in a book, except it is a complete index rather than a subset. If every word in a book would be in the index, the index would be way larger than the book itself, which is exactly what happens in Splunk. If you look at an index bucket directory on disk, you will find that the size for the index and other metadata files often exceeds the size of the compressed raw data.

Searches using tstats only use the tsidx files, i.e. Splunk does not have to read, unzip and search the journal.gz files to create the search results, which is obviously orders of magnitudes faster.

Try it for yourself! The following two searches are semantically identical and should return the same exact results on your Splunk instance. Pick "Previous week" from the timerange picker and then take a look at how long they each take in Job Inspector once they are complete.

index=_internal  | stats count by sourcetype

Equivalent tstats search:

| tstats count where index=_internal by sourcetype 

In my environment, the first one takes 115s, the tstats search completes in 4s.

Note that this only works for indexed fields, not for fields extracted at search time. By default that is _time, source, host and sourcetype.

Hope that makes sense.
BTW, you can use the walklex command to take a look at what's in a given tsidx file.

View solution in original post

aaraneta_splunk
Splunk Employee
Splunk Employee

@aoliullah - Did one of the answers below help clarify what a tsdix file is? If yes, please click “Accept” below the best answer to resolve this post and upvote anything that was helpful. If no, please leave a comment with more feedback. Thanks.

0 Karma

s2_splunk
Splunk Employee
Splunk Employee

tsidx (time series index) files are created as part of the indexing pipeline processing. The incoming data is parsed into terms (think 'words' delimited by certain characters) and this list of terms is then stored along with offset (a number) that represents the location in the rawdata file (journal.gz) that the event data is written to.
It is the exact same thing as an index in a book, except it is a complete index rather than a subset. If every word in a book would be in the index, the index would be way larger than the book itself, which is exactly what happens in Splunk. If you look at an index bucket directory on disk, you will find that the size for the index and other metadata files often exceeds the size of the compressed raw data.

Searches using tstats only use the tsidx files, i.e. Splunk does not have to read, unzip and search the journal.gz files to create the search results, which is obviously orders of magnitudes faster.

Try it for yourself! The following two searches are semantically identical and should return the same exact results on your Splunk instance. Pick "Previous week" from the timerange picker and then take a look at how long they each take in Job Inspector once they are complete.

index=_internal  | stats count by sourcetype

Equivalent tstats search:

| tstats count where index=_internal by sourcetype 

In my environment, the first one takes 115s, the tstats search completes in 4s.

Note that this only works for indexed fields, not for fields extracted at search time. By default that is _time, source, host and sourcetype.

Hope that makes sense.
BTW, you can use the walklex command to take a look at what's in a given tsidx file.

jplumsdaine22
Influencer

There was a great talk at conf2016 related to this, slides are here https://conf.splunk.com/files/2016/slides/fields-indexed-tokens-and-you.pdf

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

The idx part is for "index". The ts part is "time series", but, the whole thing is generally synonymous with "index file".

http://docs.splunk.com/Splexicon:Indexfiles

An index file contains keys, and pointers to data.

If an index file exists for the fields in the data that you are looking for, then you can use the tstats command to gather information that is accessible by that index. If no index file exists for that data, then tstats wont work.

So, for example, let's suppose that you have your system set up, for a particular index and sourcetype, to index the source IP address into a field called src_ip. Let's suppose you want a quick count of all the traffic on a particular day from a series of IP addresses 123.123.123.1-50. Since you have an index on that field, you can use tstats in summary mode instead of stats, which will be MUCH more efficient.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...