I have a data file that was converted from an RTF to a TXT. When I started trying to parse it using PERL, my regular expressions weren't able to split up lines that looked like they had whitespace delimiters - It would just ignore the whitespace.
After my initial confusion, I figured that the whitespace must be something other than an ASCII space character, tab, etc. By some experimentation, I noticed that there were several bytes being represented in that "whitespace".
To try and figure out what the bytes/characters were, I created a little PERL code segment that looked like:
while ($filecontents =~ /([^\d\w\s\t\.:;&\,\-\(\)]+)/){
$f = $1;
$d = $1;
$f =~ s/(.)/sprintf("%x ",ord($1))/eg;
print "f is $f\n";
$filecontents =~ s/$d/zzz/g;
}
Basically, the code goes thru the file, finds oddball characters and prints them out. When I ran it, it produced the following:
f is e2 80 83
f is e2 80 a8
f is e2 81 84
f is c2 b0
Note that each of those looks like a multi-byte character, but what are they?
Well, I do love the internet. I cut and pasted e2 80 a8 into Google and found that it was an "em space", aka Unicode character \u2003.
Once I was able to get the Unicode character, I could just replace all of the em spaces with a regular space, and the rest of my program worked as designed. Same idea with the other special characters. Two of those characters were not whitespace, but were non-ASCII characters as well (fraction slash and degree symbol).
Note that, at least in my case, I had to match using the hex versus the unicode character. In other words
$filecontents =~ s/\xe2\x80\xa8/ /g;
I'm assuming this is because the Unicode would be a UTF-16 character but I'm dealing with a UTF-8 encoding? For next time, I should see if I can export the RTF to a UTF-16 text file. Maybe it would be easier :)
No comments:
Post a Comment