Easy Programming: Python encoding and decoding

Monday, December 4, 2017

Python encoding and decoding

1. Unicode and UTF-8

Unicode is character set. UTF-8 is a way of decoding for unicode. And it includes UTF-16, UTF-32.we can find the unicode of a character by searching the standard. The UTF-8 decode can be calculated by unicode.

2. str and unicode in python 2

str and unicode are two types.

str stores the bytes after encoding. When outputting, every byte is presented by 16 digits, starting with \x. Every chinese character has three bytes.

str can be converted to unicode by decode() method.

example:

a.decode('utf-8')

unicode can be converted to str by encode() method.

example:

b.encode('utf-8')

ENCODE METHODS

1. head of a script

#-*- coding: utf-8 -*-

or

# coding=utf-8

if we do not set up, the default one is ascii. And you will get error.

2. sys.stdin.encoding and sys.stdout.encoding

3. sys.getdefaultencoding()

CHARACTER SET CONCATENATION

when concatenating str and unicode, the output is unicode. first decode str to unicode by decode(), then concatenating. Sometimes we will have errors.

READ IN FILE AND JSON

read in a file and we get str type, presented as 16-digits starting by "\x".

f=open('t.txt')
a=f.read()

a:

'{"hello":"\xe5\x92\xa9"}\n'

can be decoded to unicode by json

json.loads(a)

{u'hello':u'\u54a9'}

OUTPUT

str can be outputed to files. unicode needs to be encoded to str by encode()

COMPUTE md5

md5 computation requires unicode to be encoded to str first.

hashlib.md5(a).hexdigest()

OUTPUT TO STDOUT

when outputing to stdout , default encoding is sys.stdout.encoding, it depends on the default setting of the system.

import sys
sys.stdout.encoding

'UTF-8'

in the environment of zh_CN.GB2312, default is not UTF-8, we can not output normally.

COMMAND PARAMETER READIN

the parameters gotten by sys.argv and argparse are all str type, presented as 16-digits starting with \x. We can get encoding type by sys.stdin.encoding , and then convert to unicode.

#! /usr/bin/evn python

# coding =utf-8

import sys

print repr(sys.argv[1])

print sys.stdin.encoding

print repr(sys.argv[1].decode(sys.stdin.encoding))

python hello.py "哇嘿嘿”

'\xe5\x93\x87\xe5\x98\xbf\xe5\x98\xbf'
UTF-8
u'\u54c7\u563f\u563f'

CHARACTER START WITH \u CONVERTED TO UNICODE

b='\u54a9'

b

'\\u54a9'

convert b to chinese

1. unicode-escape

unicode(b,'unicode-escape')

u'\u54a9'

or

b.decode('unicode-escape')

u'\u54a9'

2. eval concatenation

eval('u"'+b.replace('"',r'\"')+'"'

u'\u54a9'

Easy Programming

ezoic

Monday, December 4, 2017

Python encoding and decoding

No comments:

Post a Comment

R is not a simple programming language, and it does better on reading excel files than python

Followers