I want to compare two files (big 200.000.00 and 150.000.000 lines). These are lists of domain names. I want to make the difference list. The first file is from an export from Splunk. Example: tmpzonefile: twotwotwo.nl two.nl three.nl four.nl five.nl
tmpingestedzonefile: twotwo.nl three.nl four.nl
Diff file must be: twotwo.nl five.nl
The following script yields too much. Any idea what goes wrong here? And it takes forever to process large files.
if debug == 1:
print('DEBUG: Number of ingested domains returned: %s' % str(count))
print('DEBUG: Missing domains: %s' % str(numdomains-count))
# Determine missing domains
tmpzonefile_f = open(tmpzonefile)
tmpingestedzonefile_f = open(tmpingestedzonefile)
difffile = open('/tmp/'+zone+'_zone_full.txt', 'wt')
old = [line.strip() for line in tmpzonefile_f]
new = [line.strip() for line in tmpingestedzonefile_f]
count = 0
for line in old:
if line not in new:
count += 1
difffile.write(line+'\n')
print('DEBUG: Number of domain written to difffile file: %s' % str(count))
tmpzonefile_f.close()
tmpingestedzonefile_f.close()
difffile.close()
... View more