1. Unicode and UTF-8
Unicode is character set. UTF-8 is a way of decoding for unicode. And it includes UTF-16, UTF-32.we can find the unicode of a character by searching the standard. The UTF-8 decode can be calculated by unicode.
2. str and unicode in python 2
str and unicode are two types.
str stores the bytes after encoding. When outputting, every byte is presented by 16 digits, starting with \x. Every chinese character has three bytes.
str can be converted to unicode by decode() method.
example:
a.decode('utf-8')
unicode can be converted to str by encode() method.
example:
b.encode('utf-8')
ENCODE METHODS
1. head of a script
#-*- coding: utf-8 -*-
or
# coding=utf-8
if we do not set up, the default one is ascii. And you will get error.
2. sys.stdin.encoding and sys.stdout.encoding
3. sys.getdefaultencoding()
CHARACTER SET CONCATENATION
when concatenating str and unicode, the output is unicode. first decode str to unicode by decode(), then concatenating. Sometimes we will have errors.
READ IN FILE AND JSON
read in a file and we get str type, presented as 16-digits starting by "\x".
f=open('t.txt')
a=f.read()
a:
'{"hello":"\xe5\x92\xa9"}\n'
can be decoded to unicode by json
json.loads(a)
{u'hello':u'\u54a9'}
OUTPUT
str can be outputed to files. unicode needs to be encoded to str by encode()
COMPUTE md5
md5 computation requires unicode to be encoded to str first.
hashlib.md5(a).hexdigest()
OUTPUT TO STDOUT
when outputing to stdout , default encoding is sys.stdout.encoding, it depends on the default setting of the system.
import sys
sys.stdout.encoding
'UTF-8'
in the environment of zh_CN.GB2312, default is not UTF-8, we can not output normally.
COMMAND PARAMETER READIN
the parameters gotten by sys.argv and argparse are all str type, presented as 16-digits starting with \x. We can get encoding type by sys.stdin.encoding , and then convert to unicode.
#! /usr/bin/evn python
# coding =utf-8
import sys
print repr(sys.argv[1])
print sys.stdin.encoding
print repr(sys.argv[1].decode(sys.stdin.encoding))
python hello.py "哇嘿嘿”
'\xe5\x93\x87\xe5\x98\xbf\xe5\x98\xbf'
UTF-8
u'\u54c7\u563f\u563f'
CHARACTER START WITH \u CONVERTED TO UNICODE
b='\u54a9'
b
'\\u54a9'
convert b to chinese
1. unicode-escape
unicode(b,'unicode-escape')
u'\u54a9'
or
b.decode('unicode-escape')
u'\u54a9'
2. eval concatenation
eval('u"'+b.replace('"',r'\"')+'"'
u'\u54a9'
I wrote about the solutions to some problems I found from programming and data analytics. They may help you on your work. Thank you.
ezoic
Subscribe to:
Post Comments (Atom)
looking for a man
I am a mid aged woman. I was born in 1980. I do not have any kid. no complicated dating before . I am looking for a man here for marriage...
-
I tried to commit script to bitbucket using sourcetree. I first cloned from bitbucket using SSH, and I got an error, "authentication ...
-
https://github.com/boto/boto3/issues/134 import boto3 import botocore client = boto3.client('s3') result = client.list_obje...
-
Previously, I wanted to install "script" on Atom to run PHP. And there was some problem, like the firewall. So I tried atom-runner...
No comments:
Post a Comment