
Python
Understand Python string implementation, memory optimization, and character encoding at depth.
Python caches short strings and identifiers (string interning) for performance:
# Interned strings
a = "hello"
b = "hello"
print(a is b) # True (same object in memory)
# String interning in larger contexts
x = "a" * 10
y = "a" * 10
print(x is y) # True (interned)
x = "a" * 256
y = "a" * 256
print(x is y) # True (still interned in CPython)
# Explicitly intern strings
import sys
s1 = sys.intern("hello" + "world")
s2 = sys.intern("helloworld")
print(s1 is s2) # True (same interned object)Python 3 uses Unicode by default. Understanding encoding is crucial:
# Unicode code points
char = "Γ©"
print(ord(char)) # 233 (code point)
print(chr(233)) # Γ© (character from code point)
# Bytes encoding
text = "Hello δΈη"
utf8_bytes = text.encode("utf-8")
print(utf8_bytes) # b'Hello \xe4\xb8\x96\xe7\x95\x8c'
print(len(text)) # 8 (characters)
print(len(utf8_bytes)) # 12 (bytes)
# Decoding
decoded = utf8_bytes.decode("utf-8")
print(decoded) # Hello δΈη
# Different encodings
latin1_bytes = "cafΓ©".encode("latin-1")
print(latin1_bytes) # b'caf\xe9'Different concatenation methods have different performance characteristics:
import timeit
# Inefficient: string concatenation in loop
def concat_naive(n):
result = ""
for i in range(n):
result += str(i)
return result
# Efficient: list join
def concat_list(n):
return "".join(str(i) for i in range(n))
# F-strings
def concat_fstring(n):
parts = [f"{i}" for i in range(n)]
return "".join(parts)
# Timing comparison
n = 10000
print(timeit.timeit(lambda: concat_naive(n), number=1)) # Much slower
print(timeit.timeit(lambda: concat_list(n), number=1)) # FastPython uses different internal representations (PEP 393):
# ASCII strings use 1 byte per character
ascii_string = "hello"
print(ascii_string.__sizeof__()) # ~54 bytes
# Latin-1 strings use 1 byte per character
latin1_string = "cafΓ©"
print(latin1_string.__sizeof__()) # ~54 bytes
# UCS-2 strings use 2 bytes per character
mixed_string = "hello δΈ"
print(mixed_string.__sizeof__()) # Larger due to Unicode
# Examine internal encoding
import sys
print(sys.getsizeof("a")) # Minimal overhead
print(sys.getsizeof("a" * 100)) # Scales linearly# Raw strings avoid escape processing
normal = "line1\nline2\nline3"
raw = r"line1\nline2\nline3"
# For regex, raw strings are essential
import re
# Bad: double escaping
pattern1 = "\\d{3}-\\d{4}"
# Good: raw string
pattern2 = r"\d{3}-\d{4}"
# Verify they're identical
assert re.escape(pattern1) == pattern2 + "\\-"Slicing behavior and performance considerations:
# Slicing returns new string object
original = "Hello World"
slice1 = original[0:5]
slice2 = original[0:5]
print(slice1 is slice2) # False (different objects)
# Large string slicing
large = "x" * 1000000
small_slice = large[0:100]
# CPython may optimize memory for small slices
# Stride slicing performance
text = "0123456789" * 100
every_other = text[::2] # Creates new string
# This is O(n) even though it looks simpleimport re
# Named groups for clarity
pattern = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
text = "2026-02-20"
match = re.match(pattern, text)
if match:
groups = match.groupdict()
print(groups) # {'year': '2026', 'month': '02', 'day': '20'}
# Verbose regex for documentation
verbose_pattern = r"""
(?P<year>\d{4}) # Year
- # Separator
(?P<month>\d{2}) # Month
- # Separator
(?P<day>\d{2}) # Day
"""
match = re.match(verbose_pattern, text, re.VERBOSE)
if match:
print(match.groupdict())import timeit
# When building many strings, consider efficiency
def format_f_string(n):
return [f"item_{i}" for i in range(n)]
def format_format(n):
return ["item_{}".format(i) for i in range(n)]
def format_percent(n):
return ["item_%d" % i for i in range(n)]
# F-strings are typically fastest
n = 10000
print(timeit.timeit(lambda: format_f_string(n), number=100))
print(timeit.timeit(lambda: format_format(n), number=100))| Concept | Remember |
|---|---|
| String interning | Python caches short strings; use for identity checks |
| Unicode complexity | Characters β bytes; always specify encoding |
| Concatenation | Use `join()`, not `+` loop; it's O(nΒ²) vs O(n) |
| Slicing efficiency | Slicing creates new string; use indices if possible |
| Encoding matters | Default UTF-8 handles most cases; specify when needed |
Learn advanced indexing and slicing optimization.
Ready to practice? Challenges | Quiz
Resources
Ojasa Mirai
Master AI-powered development skills through structured learning, real projects, and verified credentials. Whether you're upskilling your team or launching your career, we deliver the skills companies actually need.
Learn Deep β’ Build Real β’ Verify Skills β’ Launch Forward