Wednesday, January 11, 2012

Perl, UTF-8, and TextEdit character encoding hell

I'm working on a project where I need to dynamically read in content from a flat file. The flat file contains some template information and then I'm also pasting in the contents (using TextEdit) of a file that was generated using Perl database access.

The problem was that when I read it into a string using stringWithContentsOfFile and NSUTF8StringEncoding it would blow up. I could use NSASCIIStringEncoding, but then some of the characters (e.g em dash, single double quote) were translated incorrectly. If I brought this file up in TextEdit or Dashcode, everything looked great. Displaying the file in vi or the command line did not.

When I did a file -I foo.txt it reported the file type was "unknown" although the Perl generated file was utf-8.

I traced the problem down to the TextEdit "Plain Text File Encoding" preferences. Both "Opening Files" and "Saving Files" were set to UTF-8. This was helpful to read the data, but somehow when saving the pasted-in content, it caused the file type to get hosed such that certain tools (e.g. my command window which is set to UTF-8) could no longer properly read the characters.

Once I set the preferences back to automatic, the saved file is now utf-8, but the special characters don't display correctly in TextEdit. The file also loads with UTF8 encoding into an NSString.

Weird.

That's three hours of my life I'll never get back.