CS204 Wednesday, 25 April 2012

No 3A office hours today. (Can answer questions on
the way back to CMC, though.)

http://vim-adventures.com/

I. Encoding assignment advice

-- This program cannot be done perfectly. Sometimes
even TextWrangler and similar programs guess the
encoding wrong.

-- Don't forget that I asked you to state the
assumptions your program makes. You could make
extremely restrictive assumptions (e.g. the
file consists of only ASCII characters).

-- Try decoding from utf-8.

-- Didn't work? Take a look at the first two bytes
to see if they are a BOM.

-- Didn't work? Try decoding from utf-16-le
How can you tell (or guess) whether it worked?
You could try examining the first few characters
in the decoded unicode string. If they're mostly
under 256 codepoints, then you probably started
with a Latin alphabet text file. If not...what
should you do?

   s = myString.decode('utf-16-le')
   for ch in s:
      if ord(ch) < 256:
         that's a latin-ish character

-- etc.

BOM = the character whose codepoint is 0xFEFF
placed at the very beginning of a file.

If the file looks like this:
   0xfe 0xff -- that's a UTF-16 file, big endian

(build a 16-bit integer out of these two bytes in
a big-endian way: 0xfeff)

If the file looks like this:
   0xff 0xfe -- that's a UTF-16 file, little endian


II. Walk through XML sample.

I like lxml better than minidom, but it's
not built in to Python's distributions.