You're viewing all posts tagged with unicode

Unicode ‘funny characters’

There’re characters that sometimes cause strange behaviour, when trying to print them to console.

It seems that depends on environment and Python compilation. I’ve tested it on Windows Vista and it failed, than it worked on some *nix machines and failed on others. I used Python 2.6.x version. So, it’s possible you will be unable to rebpoduce it!

An example

Russian character ‘ы’ has code U+044B, symbol ‘©’ has code U+00A9.

    >>> a = u'ы'
    >>> a
    u'\u044b'
    >>> b = u'\u00a9'
    >>> b
    u'\xa9'

Trying to print:

    >>> print a
    ы
    >>> print b
    ...
    UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 0: character maps to <undefined>
    >>>

Try to write to file:

    >>> f = open('test.txt', 'w')
    >>> f.write((a + b).encode('utf-8'))
    >>> f.close()

That’s working ok!

What to do

Walk wide.

I’ve found a suggestion to remove such charecters from text: http://code.activestate.com/recipes/546517-accent2htmlcodepy-convert-accents-and-special-char/

Possibly that’s not best solution, but may be nesseccary, if you want to print text with ‘funny characters’ to console.

_spec_chars = [u'\xc1',u'\xe1',u'\xc0',u'\xc2',u'\xe0',u'\xc2',u'\xe2',u'\xc4',u'\xe4',u'\xc3',u'\xe3',u'\xc5',u'\xe5',u'\xc6',u'\xe6',u'\xc7',u'\xe7',u'\xd0',u'\xf0',u'\xc9',u'\xe9',u'\xc8',u'\xe8',u'\xca',u'\xea',u'\xcb',u'\xeb',u'\xcd',u'\xed',u'\xcc',u'\xec',u'\xce',u'\xee',u'\xcf',u'\xef',u'\xd1',u'\xf1',u'\xd3',u'\xf3',u'\xd2',u'\xf2',u'\xd4',u'\xf4',u'\xd6',u'\xf6',u'\xd5',u'\xf5',u'\xd8',u'\xf8',u'\xdf',u'\xde',u'\xfe',u'\xda',u'\xfa',u'\xd9',u'\xf9',u'\xdb',u'\xfb',u'\xdc',u'\xfc',u'\xdd',u'\xfd',u'\xff',u'\xa9',u'\xae',u'\u2122',u'\u20ac',u'\xa2',u'\xa3',u'\u2018',u'\u2019',u'\u201c',u'\u201d',u'\xab',u'\xbb',u'\u2014',u'\u2013',u'\xb0',u'\xb1',u'\xbc',u'\xbd',u'\xbe',u'\xd7',u'\xf7',u'\u03b1',u'\u03b2',u'\u221e']

def cleanspec(s, cleaned=_spec_chars):
    return ''.join([(c in cleaned and ' ' or c) for c in s])

Try print cleaned text:

    >>> print cleanspec(b + a)
     ы

That is workaround. May be, that’s an issue for Python’s print, I’m not quite sure about that.

That’s it :)

Tags: python unicode  
Comments: 45