hashlib

Introduction to Hash Algorithms

Python’s hashlib module provides common hash algorithms such as MD5 and SHA1.

What is a hash algorithm? Also known as a digest algorithm or hashing algorithm, it uses a function to convert data of arbitrary length into a fixed-length data string (usually represented as a hexadecimal string).

For example, suppose you write an article with the content string 'how to use python hashlib - by Michael', and attach its hash value: '2d73d4f15c0db7f5ecb321b6a65e5d6d'. If someone tampers with your article and publishes it as 'how to use python hashlib - by Bob', you can immediately point out the tampering—because the hash calculated from the altered string will differ from the original.

Clearly, a hash algorithm uses a hash function hash(data) to compute a fixed-length hash digest for arbitrary-length data data, with the goal of detecting whether the original data has been tampered with.

Hash algorithms can detect tampering because hash functions are one-way functions: calculating digest = hash(data) is easy, but reversing the process (deriving data from digest) is extremely difficult. Furthermore, modifying even a single bit of the original data will result in a completely different hash value.

Let’s use the common MD5 hash algorithm to calculate the MD5 value of a string:

import hashlib

md5 = hashlib.md5()
md5.update('how to use md5 in python hashlib?'.encode('utf-8'))
print(md5.hexdigest())

The output is:

d26a53750bc40b38b65a520292f69306

For large datasets, you can call update() in chunks—the final result remains the same:

import hashlib

md5 = hashlib.md5()
md5.update('how to use md5 in '.encode('utf-8'))
md5.update('python hashlib?'.encode('utf-8'))
print(md5.hexdigest())

Try changing a single letter and observe how the result changes completely.

MD5 is the most common hash algorithm: it is fast, and generates a fixed 128-bit/16-byte result (typically represented as a 32-character hexadecimal string).

Another common hash algorithm is SHA1, which is invoked almost identically to MD5:

import hashlib

sha1 = hashlib.sha1()
sha1.update('how to use sha1 in '.encode('utf-8'))
sha1.update('python hashlib?'.encode('utf-8'))
print(sha1.hexdigest())

SHA1 produces a 160-bit/20-byte result (usually a 40-character hexadecimal string).

More secure alternatives to SHA1 are SHA256 and SHA512—however, more secure algorithms are slower and produce longer hash values.

Is it possible for two different sets of data to produce the same hash via a given algorithm? Absolutely. All hash algorithms map an infinite set of data to a finite set of hashes. This scenario is called a collision. For example, Bob could attempt to create an article 'how to learn hashlib in python - by Bob' whose hash matches yours exactly. While not impossible, this is extremely difficult to achieve.

Applications of Hash Algorithms

Where are hash algorithms used? A common example:

Any website that allows user login must store usernames and passwords. How should these credentials be stored? One approach is to save them in a database table:

name	password
michael	123456
bob	abc999
alice	alice2008

Storing passwords in plaintext is risky: if the database is compromised, hackers gain access to all user passwords. Additionally, website administrators can access the database and view all plaintext passwords.

The correct approach is to store hashed passwords (e.g., MD5) instead of plaintext:

username	password
michael	e10adc3949ba59abbe56e057f20f883e
bob	878ef96e86145580c38c87f0410ad153
alice	99b1c2188db85afee403b1536010c2c9

When a user logs in:

Calculate the MD5 hash of the password they enter;
Compare it to the MD5 hash stored in the database;
If they match, the password is correct; otherwise, it is invalid.

Exercise 1

Write a function to calculate the MD5 hash of a user-input password (for database storage):

def calc_md5(password):
    pass

Storing MD5 hashes ensures that even administrators with database access cannot view plaintext passwords.

Design a function to validate user logins (return True/False based on password correctness):

db = {
    'michael': 'e10adc3949ba59abbe56e057f20f883e',
    'bob': '878ef96e86145580c38c87f0410ad153',
    'alice': '99b1c2188db85afee403b1536010c2c9'
}

def login(user, password):
    pass

# Test cases:
assert login('michael', '123456')
assert login('bob', 'abc999')
assert login('alice', 'alice2008')
assert not login('michael', '1234567')
assert not login('bob', '123456')
assert not login('alice', 'Alice2008')
print('ok')

Is storing MD5 hashes completely secure? Not necessarily. Suppose you are a hacker who has obtained a database of MD5 hashed passwords—how would you reverse-engineer the plaintext passwords? Brute-force cracking is time-consuming, and real hackers avoid this approach.

Consider this scenario: many users choose simple passwords like 123456, 888888, or password. Hackers can precompute MD5 hashes for these common passwords to create a reverse lookup table:

hash_to_plain = {
    'e10adc3949ba59abbe56e057f20f883e': '123456',
    '21218cca77804d2ba1922c33e0151105': '888888',
    '5f4dcc3b5aa765d61d8327deb882cf99': 'password',
    '...': '...'
}

With this table, hackers can match database MD5 hashes to plaintext passwords without cracking—compromising accounts with weak passwords instantly.

Users should avoid simple passwords, but can we strengthen security at the program level?

Since MD5 hashes of common passwords are easy to precompute, we need to ensure stored hashes do not match these precomputed values. This is achieved by appending a complex string to the original password—commonly known as salting:

def calc_md5(password):
    return get_md5(password + 'the-Salt')

With salted MD5 hashes, even simple passwords are hard to reverse-engineer—as long as the salt remains secret.

However, if two users use the same simple password (e.g., 123456), their salted MD5 hashes will still be identical (revealing shared passwords). Can we store unique MD5 hashes for users with identical passwords?

Assuming usernames are immutable, we can include the username as part of the salt when calculating MD5—ensuring unique hashes even for identical passwords.

Exercise 2

Simulate user registration by calculating a more secure MD5 hash from the username and password:

db = {}

def register(username, password):
    db[username] = get_md5(password + username + 'the-Salt')

Implement login validation using the modified MD5 algorithm:

import hashlib, random

class User(object):
    def __init__(self, username, password):
        self.username = username
        self.salt = ''.join([chr(random.randint(48, 122)) for i in range(20)])
        self.password = get_md5(password + self.salt)

db = {
    'michael': User('michael', '123456'),
    'bob': User('bob', 'abc999'),
    'alice': User('alice', 'alice2008')
}

def get_md5(user, pws):
    return ???

def login(username, password):
    user = db[username]
    return user.password == get_md5(user, password)

# Test cases:
assert login('michael', '123456')
assert login('bob', 'abc999')
assert login('alice', 'alice2008')
assert not login('michael', '1234567')
assert not login('bob', '123456')
assert not login('alice', 'Alice2008')
print('ok')

Summary

Hash algorithms have widespread applications across many fields. It is important to note that hash algorithms are not encryption algorithms: they cannot be used for encryption (since plaintext cannot be derived from a hash), only for tamper detection. However, their one-way computation property enables password validation without storing plaintext credentials.

Reference Source Code

use_hashlib.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import hashlib

md5 = hashlib.md5()
md5.update("how to use md5 in python hashlib?".encode("utf-8"))
print(md5.hexdigest())

sha1 = hashlib.sha1()
sha1.update("how to use sha1 in ".encode("utf-8"))
sha1.update("python hashlib?".encode("utf-8"))
print(sha1.hexdigest())

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import hashlib

md5 = hashlib.md5()
md5.update("how to use md5 in python hashlib?".encode("utf-8"))
print(md5.hexdigest())

sha1 = hashlib.sha1()
sha1.update("how to use sha1 in ".encode("utf-8"))
sha1.update("python hashlib?".encode("utf-8"))
print(sha1.hexdigest())

Python for beginner

Curriculum