How to import pdf file in Splunk with automated sc...

snehalk · ‎03-22-2017

Hello All,

I have requirement where i need to monitor pdf files and import in splunk for searching.

But found that we can not do the import directly because as it binary encode file (pdf) So is there any way to do using our configuration in props.conf or transforms.conf file?

Or any script that will convert the pdf file to text file in splunk?

Thank you

s2_splunk · ‎03-22-2017

If you can find a PDF processor that can convert the PDF to a text file, you can use invalid_cause and unarchive_cmd settings to invoke that processor and send the resulting output to your indexer. There are multiple examples to be found here on Answers.

You will probably have to spend some time to configure proper line-breaking/merging so that the indexed documents are easily searchable, but in theory, that should work. How practical this is I am not sure. Splunk is really not designed for searching documents/LOBs.

snehalk · ‎03-23-2017

Hello Ssievert,

Thanks for response, can you please what will be the command which can use in linux/windows for extracting pdf file in splunk?

s2_splunk · ‎03-23-2017

The link in my response above is a google search for available tools that you can use to convert PDF to text. You would need to find and install the tool of your choice on your forwarder and configure it in unarchive_cmd (see the second hyperlink in my response).
Splunk has no native capability to do the PDF to text conversion for you, but we allow you to plug in your own pre-processor via the invalid_cause/unarchive_cmd mechanism.

snehalk · ‎03-24-2017

Hello Ssievert,
i have done one small script which will replace the contain of file just to check how the unarchive_cmd works, but am not getting the output .

below are my configuration

inputs.conf

[monitor://C:\\mytest\*]
disabled = false
index = main
sourcetype = sampledata

props.conf

[sampledata]
NO_BINARY_CHECK = true
invalid_cause = archive
unarchive_cmd = python C:\Program Files\Splunk\etc\apps\testapp\bin\retxt.py

retxt.py

f1 = open('C:\\mytest\\1.txt', 'r')
f2 = open('C:\\mytest\\2.txt', 'w')
for line in f1:
    f2.write(line.replace('xyz', 'abcd'))
f1.close()
f2.close()

Note: if i run python script independently it work fine. can you please let me know where am wrong?

How to import pdf file in Splunk with automated script

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life