Lesson 6

Common Base64 Mistakes

Wrong alphabet, line breaks, decoding UTF-8 text incorrectly.

Base64 pitfalls cluster around four themes: alphabet mismatches, whitespace surprises, double encoding, and mishandling UTF-8 text vs raw bytes.

Alphabet soup

Symptoms:

intermittent decode failures when -/_ (URL-safe) and +// (standard) are mixed
pasted JWT-looking fragments that still contain + or / where the spec expected Base64URL

Remediation: centralize conversions; write unit tests asserting round-trip fidelity for canonical samples (padded MIME, unpadded URL-safe vectors).

Avoid regex “fixups” scattered through services—eventually one path forgets trimming or normalizes wrongly.

Case sensitivity quirks

Alphabet casing matters for A–Z vs a–z segment indexes; accidental .toLowerCase() on entire payload destroys information.

Lingering whitespace and newlines

Copy from PDFs/email often injects stray spaces & soft breaks.

Strategies:

decode via library flag ignoreWhitespace only when spec allows
PEM parsing strips header/footer lines—not arbitrary interior spaces reliably
log pipeline might split records—reject suspiciously short fragments

Treat invisible Unicode spaces (narrow no-break space, BOM) like bugs: normalize input early with explicit allowance list.

URL decoding order bugs

Applying percent-decoding twice or swapping order against Base64 normalization yields garbage bytes—especially %2B, %2F, %3D (+,/,= equivalents).

Exercise path: user-submitted tokens through multiple proxies—capture raw query before frameworks mutate spaces.

Double encoding accidental ladder

Someone Base64 encodes credentials JSON, receives string, mistakenly Base64 again, stores in DB—the stored artifact no longer matches any party’s assumptions.

Establish naming in code (encodeOnce, clarify layer responsibilities) and assertion logs print length + first few alphabet chars hashed for forensic diff without leaking secrets.

UTF-8 text confusion

“You Base64-encoded my Chinese string wrongly.” Frequently the failure is a layer mix-up: someone encoded UTF-8 bytes of the string but decoded into the wrong charset, or compared against a different normalization of the same logical text.

Use your language’s helpers that make the pipeline explicit: Unicode string → UTF-8 bytes → Base64 text, and the reverse when decoding.

Never assume a legacy single-byte default—platform code pages have historically produced mojibake after decode.

JSON payloads

JSON requires UTF-8 (or documented escape-heavy alternatives). Encode serialized JSON bytes, not textual pretty-print divergence with BOM differences versus server normalization.

Database text vs binary columns

Saving Base64-decoded protobuf into TEXT typed column with collation transforms can subtly corrupt—inappropriate but observed when ORM coercion mislabels types.

Security footnotes

Base64 blobs are opaque to casual eyes, not confidential. Shipping secrets only Base64-encoded is equivalent to plaintext on untrusted transports.

Signatures & MACs operate on canonical byte layouts—changing padding form or casing invalidates verification silently.

Constant-time pitfalls

Rare but real: naive string equality comparisons on Base64-derived secrets vs timing attacks—prefer crypto libraries’ secure compare primitives when handling raw tokens—not Base64 invention problem but adjacent.

Key takeaway

Interoperability wins when teams agree on:

Alphabet & padding rules
Pre-decoding normalization (newline/URL steps)
Whether content is textual (UTF-8) or opaque bytes
No recursive encoding without documented protocol layering

Establish golden vectors in CI to catch regressions early.

← Back to course overview