Getting Data In

When to create an additional sourcetype vs new indexed fields when events are in JSON format?

jagadeeshm
Contributor

We have an application that reads events from Kafka using Kafka Consumers and persist them into database (mysql/oracle). Each event has a table name field that tells consumer into which table to persist the event. The event itself can be deserialized into a JSON string.

I am working on a process to ingest these events into Splunk via HEC.

Because the events are in JSON format, I understand that when the events are ingested into Splunk, Splunk tags the sourceType as _json.

Because we have millions/billions of these events, sourceType field, which is considered as a default field in Splunk goes underutilized when the data is indexed.

Reading through the documentation -
http://docs.splunk.com/Documentation/Splunk/6.4.3/Data/Aboutdefaultfields
http://docs.splunk.com/Documentation/Splunk/6.4.3/Data/Configureindex-timefieldextraction

I understand we have at least 2 ways to deal with these scenarios.

1 - Create new sourceTypes, where name of the sourceType is the name of the table and the underlying definition is same as sourceType _json. Looks easy but we probably may have 100's of such sourceTypes

2 - Create an additional indexed field, lets say "tablename". Looks easy but we probably end-up needing additional space for the index because this field is extracted during the index time.

Any suggestion on what is the better approach?

WalterBoyd
New Member

You can use JSON to configure multiple Sources in either of the following ways:
Create a single JSON file with the configuration information for all the Sources (sources.json).
Create individual JSON files, one for each Source, and then combine them in a single folder. You then configure the Source folder instead of the individual Sources.
photo editor

0 Karma

Jeremiah
Motivator

Be aware that in 6.4 there are two different HEC endpoints you can write to.

The /services/collector endpoint does not pass events through the event processing pipeline. This means index-time processing of sourcetypes won't work here. So you actually don't want to use _json as the sourcetype, because the _json sourcetype extracts json events at index time. You'll notice in the _json definition that INDEXED_EXTRACTIONS = json and KV_MODE = none. What that does is tell Splunk to create your json fields at index time, and skip auto-extracting the fields at search time. Otherwise, you'd end up with two entries for each field (Splunk would show the index-time and the search-time field).

The new /services/collector/raw endpoint, however, will pass data through the event processing pipeline. So you can post json data as _json and use index-time field extractions, transforms, and so on. Hopefully this difference makes sense.

http://dev.splunk.com/view/event-collector/SP-CAAAE8Y

As far as whether to use one sourcetype or multiple, or a new field, are you putting something in the sourcetype field just because you feel like you need to utilize it? If so, you may want to wait on using that field until you've used Splunk for awhile to see how best to use it for your data. You'll find differing opinions, but I think the sourcetype field should be used to describe the format of your data. Remember you can also use the source field to include information that might better describe where the data originated from (ie, the tablename). And you can create props/extractions that apply to sources as well.

0 Karma

jagadeeshm
Contributor

@jeremiah - Thanks for details on the rest end-points. I want to use the source type because it is one of the default fields in splunk and reading through some of the best practices, it was advised to use this field in the queries for better performance. If I have this sourceType as my table name, this would be a better filter for me. But which of the above two options is a better choice?

0 Karma

jagadeeshm
Contributor

Also, you mentioned - "Remember you can also use the source field to include information that might better describe where the data originated from (ie, the tablename)." How can we include table name into the source type?

0 Karma

Jeremiah
Motivator

There are two ways you can set the source field, depending on which hec endpoint you use. If you choose the /services/collector endpoint, then you can set the field when you send the event:

{
    "time": 1426279439, // epoch time
    "host": "localhost",
    "source": "datasource",
    "sourcetype": "txt",
    "index": "main",
    "event": { "Hello world!" }
}

http://dev.splunk.com/view/event-collector/SP-CAAAE6P

If you'd rather use the raw endpoint, then you can use an index-time field extract to rewrite the value of source from your data. Something like what you see in the link below, but with source instead:

http://docs.splunk.com/Documentation/Splunk/4.3.1/Data/Overridedefaulthostassignments

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...