
Python
Explore Unicode, encoding systems, string internals, and advanced text processing techniques.
Every character has a unique Unicode code point. Understanding encoding is critical for handling international text.
# Unicode code points
s = "Hello"
for char in s:
print(f"{char} โ U+{ord(char):04X}")
# Output:
# H โ U+0048
# e โ U+0065
# ...
# Creating strings from code points
char = chr(0x0041) # 'A'
print(char)
# Multi-byte characters
emoji = "๐"
print(len(emoji)) # 1 (one character)
print(emoji.encode('utf-8')) # b'\xf0\x9f\x98\x80' (4 bytes)
print(ord(emoji)) # 128512 (code point)Key insight: One character doesn't always equal one byte. "๐" takes 4 bytes in UTF-8.
# Encode: string โ bytes
text = "Cafรฉ"
utf8 = text.encode('utf-8') # b'Caf\xc3\xa9'
latin1 = text.encode('latin-1') # b'Caf\xe9'
# Decode: bytes โ string
decoded = utf8.decode('utf-8') # "Cafรฉ"
print(decoded)
# Error handling
bad_bytes = b'\xff\xfe'
print(bad_bytes.decode('utf-8', errors='ignore')) # Ignore errors
print(bad_bytes.decode('utf-8', errors='replace')) # Replace with ?
print(bad_bytes.decode('utf-8', errors='backslashreplace')) # \\xff\\xfe
# BOM (Byte Order Mark)
text_bom = "UTF-8 with BOM"
encoded = text_bom.encode('utf-8-sig') # Includes BOM
print(encoded[:3]) # b'\xef\xbb\xbf' (BOM)Some characters can be represented multiple ways. Normalization ensures consistency.
import unicodedata
# Two different representations of "รฉ"
s1 = "รฉ" # Single character U+00E9
s2 = "e\u0301" # 'e' + combining acute accent
print(s1 == s2) # False! (different representations)
print(len(s1), len(s2)) # 1, 2
# Normalize to NFC (composed form)
s1_nfc = unicodedata.normalize('NFC', s1)
s2_nfc = unicodedata.normalize('NFC', s2)
print(s1_nfc == s2_nfc) # True!
# Normalization forms
# NFC: Canonical Decomposition, followed by Canonical Composition
# NFD: Canonical Decomposition
# NFKC: Compatibility Decomposition, followed by Composition
# NFKD: Compatibility Decomposition
text = "โello" # Mathematical Alphanumeric Symbols
print(unicodedata.normalize('NFKC', text)) # "Hello" (compatibility normalization)# Grapheme clusters (multiple code points forming one visual character)
s = "๐จโ๐ฉโ๐งโ๐ฆ" # Family emoji (multiple code points)
print(len(s)) # 25 (each code point counted)
print([ord(c) for c in s]) # List of code points
# Slicing with extended slices
text = "Python"
print(text[::2]) # "Pto" (every 2nd character)
print(text[::-1]) # "nohtyP" (reverse)
print(text[1:5:2]) # "yh" (start at 1, end at 5, step 2)
# Safe indexing with get-like pattern
def safe_index(s, i, default=''):
try:
return s[i]
except IndexError:
return default# String interning (CPython optimization)
s1 = "hello"
s2 = "hello"
print(s1 is s2) # True (interned small strings)
s3 = "hello" * 100
s4 = "hello" * 100
print(s3 is s4) # False (long strings not interned)
# String methods return new strings (immutability)
s = "Hello"
s.lower() # Returns new string
print(s) # Still "Hello" (unchanged)
# Memory efficient string building
# Bad: concatenation in loop
text = ""
for i in range(1000):
text += f"line {i}\n" # Creates 1000 intermediate strings
# Good: use list and join
lines = [f"line {i}\n" for i in range(1000)]
text = "".join(lines) # Single concatenationimport unicodedata
# Introspect characters
char = "ฮฑ" # Greek alpha
print(unicodedata.name(char)) # "GREEK SMALL LETTER ALPHA"
print(unicodedata.category(char)) # "Ll" (Letter, lowercase)
# Character classification
print("A".isupper()) # True
print("5".isdigit()) # True
print("ฮฑ".isalpha()) # True
print("ฮฑ5".isalnum()) # True
print(" ".isspace()) # True
# Character decomposition
print(unicodedata.decomposition("รฉ")) # "0065 0301" (e + combining acute)| Concept | Remember |
|---|---|
| Unicode | Characters have code points; use `ord()` and `chr()` |
| Encoding | Always specify encoding when working with bytes |
| Normalization | Use NFC for consistent string comparison |
| Length | `len()` counts code points, not bytes or visual characters |
| Immutability | Strings never change; methods return new strings |
Explore String Formatting for advanced text composition techniques.
Ready to practice? Challenges | Quiz
Resources
Ojasa Mirai
Master AI-powered development skills through structured learning, real projects, and verified credentials. Whether you're upskilling your team or launching your career, we deliver the skills companies actually need.
Learn Deep โข Build Real โข Verify Skills โข Launch Forward