base64

Base64 is a method for representing arbitrary binary data using 64 characters.

When you open files like exe, jpg, or pdf with a text editor (e.g., Notepad), you will see a jumble of garbled text. This is because binary files contain many characters that cannot be displayed or printed. Therefore, to allow text-processing software like Notepad to handle binary data, a conversion method from binary to string is required. Base64 is the most common binary encoding method for this purpose.

The principle of Base64 is straightforward: first, prepare an array containing 64 characters:

['A', 'B', 'C', ... 'a', 'b', 'c', ... '0', '1', ... '+', '/']

Next, process the binary data in groups of 3 bytes (totaling 3×8=24 bits). Split these 24 bits into 4 groups, with exactly 6 bits per group:

┌───────────────┬───────────────┬───────────────┐
│      b1       │      b2       │      b3       │
├─┬─┬─┬─┬─┬─┬─┬─┼─┬─┬─┬─┬─┬─┬─┬─┼─┬─┬─┬─┬─┬─┬─┬─┤
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
├─┴─┴─┴─┴─┴─┼─┴─┴─┴─┴─┴─┼─┴─┴─┴─┴─┴─┼─┴─┴─┴─┴─┴─┤
│    n1     │    n2     │    n3     │    n4     │
└───────────┴───────────┴───────────┴───────────┘

You now have 4 numbers as indices—look them up in the array to get 4 corresponding characters, which form the encoded string.

Thus, Base64 encoding converts 3 bytes of binary data into 4 bytes of text data, increasing the length by 33%. The advantage is that the encoded text data can be displayed directly in email bodies, web pages, etc.

What if the binary data to be encoded is not a multiple of 3 bytes, leaving 1 or 2 bytes at the end? Base64 pads the end with \x00 bytes, then appends 1 or 2 = characters to the end of the encoded string to indicate how many bytes were padded. These padding characters are automatically removed during decoding.

Python’s built-in base64 module enables direct Base64 encoding and decoding:

>>> import base64
>>> base64.b64encode(b'binary\x00string')
b'YmluYXJ5AHN0cmluZw=='
>>> base64.b64decode(b'YmluYXJ5AHN0cmluZw==')
b'binary\x00string'

Standard Base64 encoding may produce the + and / characters, which cannot be used directly as parameters in URLs. For this reason, a “url-safe” Base64 encoding exists—it simply replaces + with - and / with _:

>>> base64.b64encode(b'i\xb7\x1d\xfb\xef\xff')
b'abcd++//'
>>> base64.urlsafe_b64encode(b'i\xb7\x1d\xfb\xef\xff')
b'abcd--__'
>>> base64.urlsafe_b64decode('abcd--__')
b'i\xb7\x1d\xfb\xef\xff'

You can also customize the order of the 64 characters to create a custom Base64 encoding—however, this is almost never necessary in practice.

Base64 is a table-based encoding method and cannot be used for encryption—even with a custom encoding table.

Base64 is suitable for encoding small pieces of content, such as digital certificate signatures, Cookie contents, etc.

The = character may appear in Base64-encoded strings but can cause ambiguity in URLs and Cookies. For this reason, many implementations remove the = padding after Base64 encoding:

# Standard Base64:
'abcd' -> 'YWJjZA=='
# Auto-removed =:
'abcd' -> 'YWJjZA'

How to decode a string with the = removed? Since Base64 converts 3 bytes to 4 bytes, the length of a Base64-encoded string is always a multiple of 4. To decode correctly, append = characters to make the string length a multiple of 4.

Summary

Base64 is an encoding method that converts arbitrary binary data to text strings. It is commonly used to transmit small amounts of binary data in URLs, Cookies, and web pages.

Exercise

Write a function to decode Base64 strings with the = padding removed:

import base64

def safe_base64_decode(s):
    pass

# Test cases:
assert b'abcd' == safe_base64_decode('YWJjZA=='), safe_base64_decode('YWJjZA==')
assert b'abcd' == safe_base64_decode('YWJjZA'), safe_base64_decode('YWJjZA')
print('ok')

Reference Source Code

do_base64.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import base64

s = base64.b64encode("鍦≒ython涓娇鐢˙ASE 64缂栫爜".encode("utf-8"))
print(s)
d = base64.b64decode(s).decode("utf-8")
print(d)

s = base64.urlsafe_b64encode("鍦≒ython涓娇鐢˙ASE 64缂栫爜".encode("utf-8"))
print(s)
d = base64.urlsafe_b64decode(s).decode("utf-8")
print(d)

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import base64

s = base64.b64encode("鍦≒ython涓娇鐢˙ASE 64缂栫爜".encode("utf-8"))
print(s)
d = base64.b64decode(s).decode("utf-8")
print(d)

s = base64.urlsafe_b64encode("鍦≒ython涓娇鐢˙ASE 64缂栫爜".encode("utf-8"))
print(s)
d = base64.urlsafe_b64decode(s).decode("utf-8")
print(d)

Python for beginner

Curriculum

base64

Summary

Exercise

Reference Source Code

Leave a Reply Cancel reply

Modal title