Hi,
I'm attempting to deal with data coming from a query run by the Splunk DB Connector. It pulls all the data in fine but one part of it's ingestion is really getting me: One of the fields is just a XML document which contains double quotes " marks. What Splunk seems to be doing is it only captures the data in that field up to the first double-quote and then decides it's the end of the sentence. I'd like to be able to get the whole XML as it is?
For example:
XML data:
<ShowPlanXML xmlns="http://schemas.example.com" Version="1.2" Build="11.0.6615.2"><BatchSequence><Batch><Statements>Much much much more content here</Statements></Batch></BatchSequence></ShowPlanXML>
However when this makes it to Splunk it has a field name but it stops after the equals sympbol:
query_plan = <ShowPlanXML xmlns=
I know all the data is in Splunk as I can see it, it's just not being properly captures into the field.
I assume the solution is to tell Splunk to escape the double-quotes as the data is coming in. But where/how do I do this?
Thanks
Eddie
Well after a bit more research I came up with a fix for this. Possibly this is basic 101 Splunk as if I'd known what a "search time field extraction" was before hand I would have probably fixed it immediately. Anyway here is my solution and the process I used to get to it.
Step 1 - Work out how to match the data with a regex
As I knew the data was coming from the DB query but getting lost as it was being split into keys and values by Splunk due to quotes in the data. I took a look at the raw data to work out how to match it. To aid this write a Splunk search that brings up the data, and then pipe the _raw version of the data into a table to show exactly what the data looks like:
index=myindex sourcetype=SQLEXTRACT source=mysource | table _raw
From this I could determine what regex would match the full value. In the raw data the field and value looked like this:
query_plan="<ShowPlanXML xmlns="http://schemas.example.com" Version="1.2" Build="11.0.6615.2"><BatchSequence><Batch><Statements>Much much much more content here</Statements></Batch></BatchSequence></ShowPlanXML>"
Using the Splunk search interface and the rex command I was able to develop a regex to match the XML data in the value above. Here is an example of my finished and working search with regex:
index=myindex sourcetype=SQLEXTRACT source=mysource | rex field=_raw "query_plan=\"(?<query_plan_xml>.+<\/ShowPlanXML>)" | table query_plan_xml
Note the use of a pipe and the table command after the rex command to get Splunk to display the value it had just captured in the regex.
Step 2 - How to make this a Search Time Field Extraction
This second part took me ages to get working - it turned out the problem was this configuration needs to be made on the search head - which in hindsight seems obvious because I want to do a search time extraction.
Simple extractions like this can be just added to the props.conf file on your search head (most probably in %splunk%/etc/apps/search/local). In the file called props.conf you add a new stanza (this is the part enclosed in square braces) named to match the sourcetype of your data - in this example 'mysource' and then you using the keyword 'EXTRACT-' to tell Splunk to use the regex to extract data. The syntax of this command is EXTRACT-classname = regex expression to match.
So in the props.conf on my search head I added the following at the end of the file:
[SQLEXTRACT]
EXTRACT-queryplanXML_class = query_plan=\"(?<query_plan>.+<\/ShowPlanXML>)
This line simply tells Splunk that when it is searching data with a sourcetype of 'SQLEXTRACT' it should run the following lines on it. The extract command then match's the XML data and returns it with a key name of 'query_plan' which overwrites the existing incorrectly matched data. I'm not really sure what the class name part does or where else it might be used - I just picked something unlikely to be used anywhere else.
Once you have updated the props.conf file on your search head make sure to get Splunk to re-read this file. You can do this by restarting Splunk but because this is a search time configuration you can get Splunk to re-read the props.conf file by running the following search:
| extract reload=t
This final command won't give you any results but that's okay because Splunk will reload the props.conf file, just re-run your original search and you should find the incorrectly matched search is now returning all the data as it should be!
The workaround I used for this problem is by replacing the double quotes by null string using REPLACE function of SQL. The changes I have done in dbinput configuration only.
@santosh_sshanbh yup. doing the same thing. DBConnect should address this issue. Free text fields should be able to get ingested and indexed "as is" without breaking the indexing.
Well after a bit more research I came up with a fix for this. Possibly this is basic 101 Splunk as if I'd known what a "search time field extraction" was before hand I would have probably fixed it immediately. Anyway here is my solution and the process I used to get to it.
Step 1 - Work out how to match the data with a regex
As I knew the data was coming from the DB query but getting lost as it was being split into keys and values by Splunk due to quotes in the data. I took a look at the raw data to work out how to match it. To aid this write a Splunk search that brings up the data, and then pipe the _raw version of the data into a table to show exactly what the data looks like:
index=myindex sourcetype=SQLEXTRACT source=mysource | table _raw
From this I could determine what regex would match the full value. In the raw data the field and value looked like this:
query_plan="<ShowPlanXML xmlns="http://schemas.example.com" Version="1.2" Build="11.0.6615.2"><BatchSequence><Batch><Statements>Much much much more content here</Statements></Batch></BatchSequence></ShowPlanXML>"
Using the Splunk search interface and the rex command I was able to develop a regex to match the XML data in the value above. Here is an example of my finished and working search with regex:
index=myindex sourcetype=SQLEXTRACT source=mysource | rex field=_raw "query_plan=\"(?<query_plan_xml>.+<\/ShowPlanXML>)" | table query_plan_xml
Note the use of a pipe and the table command after the rex command to get Splunk to display the value it had just captured in the regex.
Step 2 - How to make this a Search Time Field Extraction
This second part took me ages to get working - it turned out the problem was this configuration needs to be made on the search head - which in hindsight seems obvious because I want to do a search time extraction.
Simple extractions like this can be just added to the props.conf file on your search head (most probably in %splunk%/etc/apps/search/local). In the file called props.conf you add a new stanza (this is the part enclosed in square braces) named to match the sourcetype of your data - in this example 'mysource' and then you using the keyword 'EXTRACT-' to tell Splunk to use the regex to extract data. The syntax of this command is EXTRACT-classname = regex expression to match.
So in the props.conf on my search head I added the following at the end of the file:
[SQLEXTRACT]
EXTRACT-queryplanXML_class = query_plan=\"(?<query_plan>.+<\/ShowPlanXML>)
This line simply tells Splunk that when it is searching data with a sourcetype of 'SQLEXTRACT' it should run the following lines on it. The extract command then match's the XML data and returns it with a key name of 'query_plan' which overwrites the existing incorrectly matched data. I'm not really sure what the class name part does or where else it might be used - I just picked something unlikely to be used anywhere else.
Once you have updated the props.conf file on your search head make sure to get Splunk to re-read this file. You can do this by restarting Splunk but because this is a search time configuration you can get Splunk to re-read the props.conf file by running the following search:
| extract reload=t
This final command won't give you any results but that's okay because Splunk will reload the props.conf file, just re-run your original search and you should find the incorrectly matched search is now returning all the data as it should be!
Though further digging around I've discovered that running the SQL query directly from DB Connect's SQL Explorer it captures all the fields perfectly, including the query_plan field.
So the incorrect matching of quotes must be happening after the data has passed though the DB Connect app and into Splunk proper. It appears I'm hitting a known thing (feature/bug) with DB Connect: http://docs.splunk.com/Documentation/DBX/3.1.3/DeployDBX/Troubleshooting#Incomplete_field_values_are...
It seems to recommend using an extract to fix this - but I'm stuck as to where to even begin with this as reading the Extract documentation doesn't seem to give any way of matching this field as a whole either.
I've often had to escape the double-quotes inside a field by adding escape characters in SQL. Ugly, but seems to work well.
have you tried using xpath? http://docs.splunk.com/Documentation/Splunk/7.1.0/SearchReference/Xpath
It would be a search-time extraction, but i think it might do the trick.
I don't think that's exactly what I want, or at least I can't get it to work. I say this because the data comes into Splunk as a series of field names and values enclosed in double-quotes and separated by a comma:
fieldname="value", (fieldname equal double-quote value double-quote comma)
This means the XML part is just a large value (enclosed by quotes) with a field name of query_plan however Splunk has ended that field early because it thinks it's got all when it see's the first double quote in the XML.