Ojasa Mirai

Ojasa Mirai

Python

Loading...

Learning Level

๐ŸŸข Beginner๐Ÿ”ต Advanced
What are Variables?Numbers โ€” Integers and FloatsNumber OperationsStrings โ€” Creating and Using TextString FormattingBooleans and NoneType ConversionGetting User InputBest Practices
Python/Variables Data Types/Strings Creating Using

๐Ÿ“ Strings โ€” Advanced String Fundamentals

Explore Unicode, encoding systems, string internals, and advanced text processing techniques.


๐ŸŽฏ Unicode and Character Encoding

Every character has a unique Unicode code point. Understanding encoding is critical for handling international text.

# Unicode code points
s = "Hello"
for char in s:
    print(f"{char} โ†’ U+{ord(char):04X}")

# Output:
# H โ†’ U+0048
# e โ†’ U+0065
# ...

# Creating strings from code points
char = chr(0x0041)  # 'A'
print(char)

# Multi-byte characters
emoji = "๐Ÿ˜€"
print(len(emoji))           # 1 (one character)
print(emoji.encode('utf-8'))  # b'\xf0\x9f\x98\x80' (4 bytes)
print(ord(emoji))           # 128512 (code point)

Key insight: One character doesn't always equal one byte. "๐Ÿ˜€" takes 4 bytes in UTF-8.


๐Ÿ’ก String Encoding and Decoding

# Encode: string โ†’ bytes
text = "Cafรฉ"
utf8 = text.encode('utf-8')      # b'Caf\xc3\xa9'
latin1 = text.encode('latin-1')  # b'Caf\xe9'

# Decode: bytes โ†’ string
decoded = utf8.decode('utf-8')   # "Cafรฉ"
print(decoded)

# Error handling
bad_bytes = b'\xff\xfe'
print(bad_bytes.decode('utf-8', errors='ignore'))    # Ignore errors
print(bad_bytes.decode('utf-8', errors='replace'))   # Replace with ?
print(bad_bytes.decode('utf-8', errors='backslashreplace'))  # \\xff\\xfe

# BOM (Byte Order Mark)
text_bom = "UTF-8 with BOM"
encoded = text_bom.encode('utf-8-sig')  # Includes BOM
print(encoded[:3])  # b'\xef\xbb\xbf' (BOM)

๐ŸŽจ String Normalization (Unicode Combining Characters)

Some characters can be represented multiple ways. Normalization ensures consistency.

import unicodedata

# Two different representations of "รฉ"
s1 = "รฉ"          # Single character U+00E9
s2 = "e\u0301"    # 'e' + combining acute accent

print(s1 == s2)   # False! (different representations)
print(len(s1), len(s2))  # 1, 2

# Normalize to NFC (composed form)
s1_nfc = unicodedata.normalize('NFC', s1)
s2_nfc = unicodedata.normalize('NFC', s2)
print(s1_nfc == s2_nfc)  # True!

# Normalization forms
# NFC: Canonical Decomposition, followed by Canonical Composition
# NFD: Canonical Decomposition
# NFKC: Compatibility Decomposition, followed by Composition
# NFKD: Compatibility Decomposition

text = "โ„Œello"  # Mathematical Alphanumeric Symbols
print(unicodedata.normalize('NFKC', text))  # "Hello" (compatibility normalization)

๐Ÿ“Š Advanced String Indexing and Slicing

# Grapheme clusters (multiple code points forming one visual character)
s = "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ"  # Family emoji (multiple code points)
print(len(s))       # 25 (each code point counted)
print([ord(c) for c in s])  # List of code points

# Slicing with extended slices
text = "Python"
print(text[::2])    # "Pto" (every 2nd character)
print(text[::-1])   # "nohtyP" (reverse)
print(text[1:5:2])  # "yh" (start at 1, end at 5, step 2)

# Safe indexing with get-like pattern
def safe_index(s, i, default=''):
    try:
        return s[i]
    except IndexError:
        return default

๐Ÿ”‘ String Internals and Optimization

# String interning (CPython optimization)
s1 = "hello"
s2 = "hello"
print(s1 is s2)  # True (interned small strings)

s3 = "hello" * 100
s4 = "hello" * 100
print(s3 is s4)  # False (long strings not interned)

# String methods return new strings (immutability)
s = "Hello"
s.lower()  # Returns new string
print(s)   # Still "Hello" (unchanged)

# Memory efficient string building
# Bad: concatenation in loop
text = ""
for i in range(1000):
    text += f"line {i}\n"  # Creates 1000 intermediate strings

# Good: use list and join
lines = [f"line {i}\n" for i in range(1000)]
text = "".join(lines)  # Single concatenation

๐ŸŽจ Character Properties and Classification

import unicodedata

# Introspect characters
char = "ฮฑ"  # Greek alpha
print(unicodedata.name(char))    # "GREEK SMALL LETTER ALPHA"
print(unicodedata.category(char))  # "Ll" (Letter, lowercase)

# Character classification
print("A".isupper())    # True
print("5".isdigit())    # True
print("ฮฑ".isalpha())    # True
print("ฮฑ5".isalnum())   # True
print(" ".isspace())    # True

# Character decomposition
print(unicodedata.decomposition("รฉ"))  # "0065 0301" (e + combining acute)

๐Ÿ”‘ Key Takeaways

ConceptRemember
UnicodeCharacters have code points; use `ord()` and `chr()`
EncodingAlways specify encoding when working with bytes
NormalizationUse NFC for consistent string comparison
Length`len()` counts code points, not bytes or visual characters
ImmutabilityStrings never change; methods return new strings

๐Ÿ”— What's Next?

Explore String Formatting for advanced text composition techniques.


Ready to practice? Challenges | Quiz


Resources

Python Docs

Ojasa Mirai

Master AI-powered development skills through structured learning, real projects, and verified credentials. Whether you're upskilling your team or launching your career, we deliver the skills companies actually need.

Learn Deep โ€ข Build Real โ€ข Verify Skills โ€ข Launch Forward

Courses

PythonFastapiReactJSCloud

ยฉ 2026 Ojasa Mirai. All rights reserved.

TwitterGitHubLinkedIn