Getting Data In

How to import pdf file in Splunk with automated script

snehalk
Communicator

Hello All,

I have requirement where i need to monitor pdf files and import in splunk for searching.

But found that we can not do the import directly because as it binary encode file (pdf) So is there any way to do using our configuration in props.conf or transforms.conf file?

Or any script that will convert the pdf file to text file in splunk?

Thank you

s2_splunk
Splunk Employee
Splunk Employee

If you can find a PDF processor that can convert the PDF to a text file, you can use invalid_cause and unarchive_cmd settings to invoke that processor and send the resulting output to your indexer. There are multiple examples to be found here on Answers.

You will probably have to spend some time to configure proper line-breaking/merging so that the indexed documents are easily searchable, but in theory, that should work. How practical this is I am not sure. Splunk is really not designed for searching documents/LOBs.

snehalk
Communicator

Hello Ssievert,

Thanks for response, can you please what will be the command which can use in linux/windows for extracting pdf file in splunk?

0 Karma

s2_splunk
Splunk Employee
Splunk Employee

The link in my response above is a google search for available tools that you can use to convert PDF to text. You would need to find and install the tool of your choice on your forwarder and configure it in unarchive_cmd (see the second hyperlink in my response).
Splunk has no native capability to do the PDF to text conversion for you, but we allow you to plug in your own pre-processor via the invalid_cause/unarchive_cmd mechanism.

0 Karma

snehalk
Communicator

Hello Ssievert,
i have done one small script which will replace the contain of file just to check how the unarchive_cmd works, but am not getting the output .

below are my configuration

inputs.conf

[monitor://C:\\mytest\*]
disabled = false
index = main
sourcetype = sampledata

props.conf

[sampledata]
NO_BINARY_CHECK = true
invalid_cause = archive
unarchive_cmd = python C:\Program Files\Splunk\etc\apps\testapp\bin\retxt.py

retxt.py

f1 = open('C:\\mytest\\1.txt', 'r')
f2 = open('C:\\mytest\\2.txt', 'w')
for line in f1:
    f2.write(line.replace('xyz', 'abcd'))
f1.close()
f2.close()

Note: if i run python script independently it work fine. can you please let me know where am wrong?

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...