Solved: Escaping the double-quotes when ingesting data?

marrette · ‎05-13-2018

Hi,

I'm attempting to deal with data coming from a query run by the Splunk DB Connector. It pulls all the data in fine but one part of it's ingestion is really getting me: One of the fields is just a XML document which contains double quotes " marks. What Splunk seems to be doing is it only captures the data in that field up to the first double-quote and then decides it's the end of the sentence. I'd like to be able to get the whole XML as it is?

For example:
XML data:

<ShowPlanXML xmlns="http://schemas.example.com" Version="1.2" Build="11.0.6615.2"><BatchSequence><Batch><Statements>Much much much more content here</Statements></Batch></BatchSequence></ShowPlanXML>

However when this makes it to Splunk it has a field name but it stops after the equals sympbol:

query_plan = <ShowPlanXML xmlns=

I know all the data is in Splunk as I can see it, it's just not being properly captures into the field.

I assume the solution is to tell Splunk to escape the double-quotes as the data is coming in. But where/how do I do this?

Thanks
Eddie

marrette · ‎05-29-2018

Well after a bit more research I came up with a fix for this. Possibly this is basic 101 Splunk as if I'd known what a "search time field extraction" was before hand I would have probably fixed it immediately. Anyway here is my solution and the process I used to get to it.

Step 1 - Work out how to match the data with a regex

As I knew the data was coming from the DB query but getting lost as it was being split into keys and values by Splunk due to quotes in the data. I took a look at the raw data to work out how to match it. To aid this write a Splunk search that brings up the data, and then pipe the _raw version of the data into a table to show exactly what the data looks like:

index=myindex sourcetype=SQLEXTRACT source=mysource | table _raw

From this I could determine what regex would match the full value. In the raw data the field and value looked like this:

query_plan="<ShowPlanXML xmlns="http://schemas.example.com" Version="1.2" Build="11.0.6615.2"><BatchSequence><Batch><Statements>Much much much more content here</Statements></Batch></BatchSequence></ShowPlanXML>"

Using the Splunk search interface and the rex command I was able to develop a regex to match the XML data in the value above. Here is an example of my finished and working search with regex:

index=myindex sourcetype=SQLEXTRACT source=mysource | rex field=_raw "query_plan=\"(?<query_plan_xml>.+<\/ShowPlanXML>)" | table query_plan_xml

Note the use of a pipe and the table command after the rex command to get Splunk to display the value it had just captured in the regex.

Step 2 - How to make this a Search Time Field Extraction
This second part took me ages to get working - it turned out the problem was this configuration needs to be made on the search head - which in hindsight seems obvious because I want to do a search time extraction.

Simple extractions like this can be just added to the props.conf file on your search head (most probably in %splunk%/etc/apps/search/local). In the file called props.conf you add a new stanza (this is the part enclosed in square braces) named to match the sourcetype of your data - in this example 'mysource' and then you using the keyword 'EXTRACT-' to tell Splunk to use the regex to extract data. The syntax of this command is EXTRACT-classname = regex expression to match.

So in the props.conf on my search head I added the following at the end of the file:

[SQLEXTRACT]
EXTRACT-queryplanXML_class = query_plan=\"(?<query_plan>.+<\/ShowPlanXML>)

This line simply tells Splunk that when it is searching data with a sourcetype of 'SQLEXTRACT' it should run the following lines on it. The extract command then match's the XML data and returns it with a key name of 'query_plan' which overwrites the existing incorrectly matched data. I'm not really sure what the class name part does or where else it might be used - I just picked something unlikely to be used anywhere else.

Once you have updated the props.conf file on your search head make sure to get Splunk to re-read this file. You can do this by restarting Splunk but because this is a search time configuration you can get Splunk to re-read the props.conf file by running the following search:

| extract reload=t

This final command won't give you any results but that's okay because Splunk will reload the props.conf file, just re-run your original search and you should find the incorrectly matched search is now returning all the data as it should be!

View solution in original post

santosh_sshanbh · ‎02-14-2020

The workaround I used for this problem is by replacing the double quotes by null string using REPLACE function of SQL. The changes I have done in dbinput configuration only.

nbarbato · ‎08-04-2020

@santosh_sshanbh yup. doing the same thing. DBConnect should address this issue. Free text fields should be able to get ingested and indexed "as is" without breaking the indexing.

marrette · ‎05-29-2018

Well after a bit more research I came up with a fix for this. Possibly this is basic 101 Splunk as if I'd known what a "search time field extraction" was before hand I would have probably fixed it immediately. Anyway here is my solution and the process I used to get to it.

Step 1 - Work out how to match the data with a regex

As I knew the data was coming from the DB query but getting lost as it was being split into keys and values by Splunk due to quotes in the data. I took a look at the raw data to work out how to match it. To aid this write a Splunk search that brings up the data, and then pipe the _raw version of the data into a table to show exactly what the data looks like:

index=myindex sourcetype=SQLEXTRACT source=mysource | table _raw

From this I could determine what regex would match the full value. In the raw data the field and value looked like this:

query_plan="<ShowPlanXML xmlns="http://schemas.example.com" Version="1.2" Build="11.0.6615.2"><BatchSequence><Batch><Statements>Much much much more content here</Statements></Batch></BatchSequence></ShowPlanXML>"

Using the Splunk search interface and the rex command I was able to develop a regex to match the XML data in the value above. Here is an example of my finished and working search with regex:

index=myindex sourcetype=SQLEXTRACT source=mysource | rex field=_raw "query_plan=\"(?<query_plan_xml>.+<\/ShowPlanXML>)" | table query_plan_xml

Note the use of a pipe and the table command after the rex command to get Splunk to display the value it had just captured in the regex.

Step 2 - How to make this a Search Time Field Extraction
This second part took me ages to get working - it turned out the problem was this configuration needs to be made on the search head - which in hindsight seems obvious because I want to do a search time extraction.

Simple extractions like this can be just added to the props.conf file on your search head (most probably in %splunk%/etc/apps/search/local). In the file called props.conf you add a new stanza (this is the part enclosed in square braces) named to match the sourcetype of your data - in this example 'mysource' and then you using the keyword 'EXTRACT-' to tell Splunk to use the regex to extract data. The syntax of this command is EXTRACT-classname = regex expression to match.

So in the props.conf on my search head I added the following at the end of the file:

[SQLEXTRACT]
EXTRACT-queryplanXML_class = query_plan=\"(?<query_plan>.+<\/ShowPlanXML>)

This line simply tells Splunk that when it is searching data with a sourcetype of 'SQLEXTRACT' it should run the following lines on it. The extract command then match's the XML data and returns it with a key name of 'query_plan' which overwrites the existing incorrectly matched data. I'm not really sure what the class name part does or where else it might be used - I just picked something unlikely to be used anywhere else.

Once you have updated the props.conf file on your search head make sure to get Splunk to re-read this file. You can do this by restarting Splunk but because this is a search time configuration you can get Splunk to re-read the props.conf file by running the following search:

| extract reload=t

This final command won't give you any results but that's okay because Splunk will reload the props.conf file, just re-run your original search and you should find the incorrectly matched search is now returning all the data as it should be!

marrette · ‎05-14-2018

Though further digging around I've discovered that running the SQL query directly from DB Connect's SQL Explorer it captures all the fields perfectly, including the query_plan field.

So the incorrect matching of quotes must be happening after the data has passed though the DB Connect app and into Splunk proper. It appears I'm hitting a known thing (feature/bug) with DB Connect: http://docs.splunk.com/Documentation/DBX/3.1.3/DeployDBX/Troubleshooting#Incomplete_field_values_are...

It seems to recommend using an extract to fix this - but I'm stuck as to where to even begin with this as reading the Extract documentation doesn't seem to give any way of matching this field as a whole either.

grittonc · ‎09-19-2018

I've often had to escape the double-quotes inside a field by adding escape characters in SQL. Ugly, but seems to work well.

cmerriman · ‎05-14-2018

have you tried using xpath? http://docs.splunk.com/Documentation/Splunk/7.1.0/SearchReference/Xpath
It would be a search-time extraction, but i think it might do the trick.

marrette · ‎05-14-2018

I don't think that's exactly what I want, or at least I can't get it to work. I say this because the data comes into Splunk as a series of field names and values enclosed in double-quotes and separated by a comma:
fieldname="value", (fieldname equal double-quote value double-quote comma)

This means the XML part is just a large value (enclosed by quotes) with a field name of query_plan however Splunk has ended that field early because it thinks it's got all when it see's the first double quote in the XML.

Escaping the double-quotes when ingesting data?

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms