Ojasa Mirai

Python

Learning Level

What are Variables?Numbers — Integers and Floats Number Operations Strings — Creating and Using Text String Formatting Booleans and None Type Conversion Getting User Input Best Practices

Python/Variables Data Types/Strings Creating Using

📝 Strings — Advanced String Fundamentals

Explore Unicode, encoding systems, string internals, and advanced text processing techniques.

🎯 Unicode and Character Encoding

Every character has a unique Unicode code point. Understanding encoding is critical for handling international text.

# Unicode code points
s = "Hello"
for char in s:
    print(f"{char} → U+{ord(char):04X}")

# Output:
# H → U+0048
# e → U+0065
# ...

# Creating strings from code points
char = chr(0x0041)  # 'A'
print(char)

# Multi-byte characters
emoji = "😀"
print(len(emoji))           # 1 (one character)
print(emoji.encode('utf-8'))  # b'\xf0\x9f\x98\x80' (4 bytes)
print(ord(emoji))           # 128512 (code point)

Key insight: One character doesn't always equal one byte. "😀" takes 4 bytes in UTF-8.

💡 String Encoding and Decoding

# Encode: string → bytes
text = "Café"
utf8 = text.encode('utf-8')      # b'Caf\xc3\xa9'
latin1 = text.encode('latin-1')  # b'Caf\xe9'

# Decode: bytes → string
decoded = utf8.decode('utf-8')   # "Café"
print(decoded)

# Error handling
bad_bytes = b'\xff\xfe'
print(bad_bytes.decode('utf-8', errors='ignore'))    # Ignore errors
print(bad_bytes.decode('utf-8', errors='replace'))   # Replace with ?
print(bad_bytes.decode('utf-8', errors='backslashreplace'))  # \\xff\\xfe

# BOM (Byte Order Mark)
text_bom = "UTF-8 with BOM"
encoded = text_bom.encode('utf-8-sig')  # Includes BOM
print(encoded[:3])  # b'\xef\xbb\xbf' (BOM)

🎨 String Normalization (Unicode Combining Characters)

Some characters can be represented multiple ways. Normalization ensures consistency.

import unicodedata

# Two different representations of "é"
s1 = "é"          # Single character U+00E9
s2 = "e\u0301"    # 'e' + combining acute accent

print(s1 == s2)   # False! (different representations)
print(len(s1), len(s2))  # 1, 2

# Normalize to NFC (composed form)
s1_nfc = unicodedata.normalize('NFC', s1)
s2_nfc = unicodedata.normalize('NFC', s2)
print(s1_nfc == s2_nfc)  # True!

# Normalization forms
# NFC: Canonical Decomposition, followed by Canonical Composition
# NFD: Canonical Decomposition
# NFKC: Compatibility Decomposition, followed by Composition
# NFKD: Compatibility Decomposition

text = "ℌello"  # Mathematical Alphanumeric Symbols
print(unicodedata.normalize('NFKC', text))  # "Hello" (compatibility normalization)

📊 Advanced String Indexing and Slicing

# Grapheme clusters (multiple code points forming one visual character)
s = "👨‍👩‍👧‍👦"  # Family emoji (multiple code points)
print(len(s))       # 25 (each code point counted)
print([ord(c) for c in s])  # List of code points

# Slicing with extended slices
text = "Python"
print(text[::2])    # "Pto" (every 2nd character)
print(text[::-1])   # "nohtyP" (reverse)
print(text[1:5:2])  # "yh" (start at 1, end at 5, step 2)

# Safe indexing with get-like pattern
def safe_index(s, i, default=''):
    try:
        return s[i]
    except IndexError:
        return default

🔑 String Internals and Optimization

# String interning (CPython optimization)
s1 = "hello"
s2 = "hello"
print(s1 is s2)  # True (interned small strings)

s3 = "hello" * 100
s4 = "hello" * 100
print(s3 is s4)  # False (long strings not interned)

# String methods return new strings (immutability)
s = "Hello"
s.lower()  # Returns new string
print(s)   # Still "Hello" (unchanged)

# Memory efficient string building
# Bad: concatenation in loop
text = ""
for i in range(1000):
    text += f"line {i}\n"  # Creates 1000 intermediate strings

# Good: use list and join
lines = [f"line {i}\n" for i in range(1000)]
text = "".join(lines)  # Single concatenation

🎨 Character Properties and Classification

import unicodedata

# Introspect characters
char = "α"  # Greek alpha
print(unicodedata.name(char))    # "GREEK SMALL LETTER ALPHA"
print(unicodedata.category(char))  # "Ll" (Letter, lowercase)

# Character classification
print("A".isupper())    # True
print("5".isdigit())    # True
print("α".isalpha())    # True
print("α5".isalnum())   # True
print(" ".isspace())    # True

# Character decomposition
print(unicodedata.decomposition("é"))  # "0065 0301" (e + combining acute)

🔑 Key Takeaways

Concept	Remember
Unicode	Characters have code points; use `ord()` and `chr()`
Encoding	Always specify encoding when working with bytes
Normalization	Use NFC for consistent string comparison
Length	`len()` counts code points, not bytes or visual characters
Immutability	Strings never change; methods return new strings

🔗 What's Next?

Explore String Formatting for advanced text composition techniques.

Ready to practice? Challenges | Quiz

Resources

Python Docs

Ojasa Mirai

Master AI-powered development skills through structured learning, real projects, and verified credentials. Whether you're upskilling your team or launching your career, we deliver the skills companies actually need.

Learn Deep • Build Real • Verify Skills • Launch Forward

Courses

Python Fastapi ReactJS Cloud

Resources

Blog & Articles GitHub Projects Video Tutorials

Ecosystem

Ojasa Mirai Site My Growth Learning Portal Community Discord

Twitter GitHub LinkedIn