Lesson 6
Common Base64 Mistakes
Wrong alphabet, line breaks, decoding UTF-8 text incorrectly.
Base64 pitfalls cluster around four themes: alphabet mismatches, whitespace surprises, double encoding, and mishandling UTF-8 text vs raw bytes.
Alphabet soup
Symptoms:
- intermittent decode failures when
-/_(URL-safe) and+//(standard) are mixed - pasted JWT-looking fragments that still contain
+or/where the spec expected Base64URL
Remediation: centralize conversions; write unit tests asserting round-trip fidelity for canonical samples (padded MIME, unpadded URL-safe vectors).
Avoid regex “fixups” scattered through services—eventually one path forgets trimming or normalizes wrongly.
Case sensitivity quirks
Alphabet casing matters for A–Z vs a–z segment indexes; accidental .toLowerCase() on entire payload destroys information.
Lingering whitespace and newlines
Copy from PDFs/email often injects stray spaces & soft breaks.
Strategies:
- decode via library flag
ignoreWhitespaceonly when spec allows - PEM parsing strips header/footer lines—not arbitrary interior spaces reliably
- log pipeline might split records—reject suspiciously short fragments
Treat invisible Unicode spaces (narrow no-break space, BOM) like bugs: normalize input early with explicit allowance list.
URL decoding order bugs
Applying percent-decoding twice or swapping order against Base64 normalization yields garbage bytes—especially %2B, %2F, %3D (+,/,= equivalents).
Exercise path: user-submitted tokens through multiple proxies—capture raw query before frameworks mutate spaces.
Double encoding accidental ladder
Someone Base64 encodes credentials JSON, receives string, mistakenly Base64 again, stores in DB—the stored artifact no longer matches any party’s assumptions.
Establish naming in code (encodeOnce, clarify layer responsibilities) and assertion logs print length + first few alphabet chars hashed for forensic diff without leaking secrets.
UTF-8 text confusion
“You Base64-encoded my Chinese string wrongly.” Frequently the failure is a layer mix-up: someone encoded UTF-8 bytes of the string but decoded into the wrong charset, or compared against a different normalization of the same logical text.
Use your language’s helpers that make the pipeline explicit: Unicode string → UTF-8 bytes → Base64 text, and the reverse when decoding.
Never assume a legacy single-byte default—platform code pages have historically produced mojibake after decode.
JSON payloads
JSON requires UTF-8 (or documented escape-heavy alternatives). Encode serialized JSON bytes, not textual pretty-print divergence with BOM differences versus server normalization.
Database text vs binary columns
Saving Base64-decoded protobuf into TEXT typed column with collation transforms can subtly corrupt—inappropriate but observed when ORM coercion mislabels types.
Security footnotes
Base64 blobs are opaque to casual eyes, not confidential. Shipping secrets only Base64-encoded is equivalent to plaintext on untrusted transports.
Signatures & MACs operate on canonical byte layouts—changing padding form or casing invalidates verification silently.
Constant-time pitfalls
Rare but real: naive string equality comparisons on Base64-derived secrets vs timing attacks—prefer crypto libraries’ secure compare primitives when handling raw tokens—not Base64 invention problem but adjacent.
Key takeaway
Interoperability wins when teams agree on:
- Alphabet & padding rules
- Pre-decoding normalization (newline/URL steps)
- Whether content is textual (UTF-8) or opaque bytes
- No recursive encoding without documented protocol layering
Establish golden vectors in CI to catch regressions early.