DZone Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world

Nic has posted 15 posts at DZone. View Full User Profile

Convert Cp1252-> Utf-8 Character Set (python And Ruby)

04.16.2008
| 21702 views |
  • submit to reddit
        Oooh, I hate character sets. Specifically that there are more than one of them. Here is a Ruby version of a Python script I found to convert cp1252 (aka windows-1252) into utf-8.

  def clean_up dirty_text
    newstr = ""
    dirty_text.length.times do |i|
      character = dirty_text[i]
      newstr += if character < 0x80
        character.chr
      elsif character < 0xC0
        "\xC2" + character.chr
      else
        "\xC3" + (character - 64).chr
      end
    end
    newstr
  end

The original Python script was (http://miscoranda.com/96):

#!/usr/bin/python
import sys
for c in sys.stdin.read(): 
   if ord(c) < 0x80: sys.stdout.write(c)
   elif ord(c) < 0xC0: sys.stdout.write('\xC2' + c)
   else: sys.stdout.write('\xC3' + chr(ord(c) - 64))
    

Comments

Simon Ask Ulsnes replied on Wed, 2007/11/21 - 6:23am

When using Ruby 1.9, it's important to avoid indexing characters in a string using the [] operator. When possible, one should always use each_char and each_byte. The reason: Because Ruby 1.9 is encoding-neutral (all strings are stored in their original encoding internally), and some character sets are multibyte (such as UTF-8), looking up a single character by index is an O(n) operation. If you loop through the characters of a string like in this snippet, that ends up giving you a run time complexity of O(n^2). Using each_char and each_byte to loop through a string allows you to achieve O(n) run time. - Simon

Nic Williams replied on Thu, 2006/04/20 - 8:32am

Re-found the python source url: http://miscoranda.com/96

Nic Williams replied on Thu, 2006/04/20 - 8:32am

@peter - This is why I post snippets, so I can learn stuff! Thx, definitely didn't know that.

Snippets Manager replied on Mon, 2012/05/07 - 2:57pm

And, yes, I do randomly run benchmarks on equivalent bits of code just to see what's marginally faster in Ruby. A very sad but intriguing hobby!

Snippets Manager replied on Mon, 2012/05/07 - 2:57pm

Did you do: dirty_text.length.times do |i| character = dirty_text[i] for Ruby 1.8 & Ruby 1.9 cross-compatibility? If not, dirty_text.each_byte do |i| takes 60% less time. (That said, in this case I think you'd maybe always want to go per byte rather than character due to the conversion.)