Here's a pattern that I've been using a lot to read files in Python 2.7:
import codecs
badrow_count = 0
goodrow_count = 0
rowcount = 0
filename = "somefile.txt"
encoding = "utf-8"
# see https://docs.python.org/2/library/codecs.html#standard-encodings
# for encodings
decode_error_handler = 'strict'
# this is the default
# see https://docs.python.org/2/library/codecs.html#codec-base-classes
# for decoding error callbacks
f = codecs.open(filename=filename, mode='rU', encoding=encoding,
errors=decode_error_handler)
eof = False
while not eof:
row = u''
try:
row = f.next()
except UnicodeDecodeError as e:
badrow_count += 1
# do other things on this row
except StopIteration:
eof = True
#except Exception as e:
# handle other issues
else:
goodrow_count += 1
# do other stuff with row
finally:
if not eof:
rowcounter += 1
else:
break
I prefer this to:
for row in f:
primarily in order to catch unicode decoder errors.
No comments:
Post a Comment