CS204 Wednesday, 25 April 2012 No 3A office hours today. (Can answer questions on the way back to CMC, though.) http://vim-adventures.com/ I. Encoding assignment advice -- This program cannot be done perfectly. Sometimes even TextWrangler and similar programs guess the encoding wrong. -- Don't forget that I asked you to state the assumptions your program makes. You could make extremely restrictive assumptions (e.g. the file consists of only ASCII characters). -- Try decoding from utf-8. -- Didn't work? Take a look at the first two bytes to see if they are a BOM. -- Didn't work? Try decoding from utf-16-le How can you tell (or guess) whether it worked? You could try examining the first few characters in the decoded unicode string. If they're mostly under 256 codepoints, then you probably started with a Latin alphabet text file. If not...what should you do? s = myString.decode('utf-16-le') for ch in s: if ord(ch) < 256: that's a latin-ish character -- etc. BOM = the character whose codepoint is 0xFEFF placed at the very beginning of a file. If the file looks like this: 0xfe 0xff -- that's a UTF-16 file, big endian (build a 16-bit integer out of these two bytes in a big-endian way: 0xfeff) If the file looks like this: 0xff 0xfe -- that's a UTF-16 file, little endian II. Walk through XML sample. I like lxml better than minidom, but it's not built in to Python's distributions.