Python’s hashlib module provides common hash algorithms such as MD5 and SHA1.
What is a hash algorithm? Also known as a digest algorithm or hashing algorithm, it uses a function to convert data of arbitrary length into a fixed-length data string (usually represented as a hexadecimal string).
For example, suppose you write an article with the content string 'how to use python hashlib - by Michael', and attach its hash value: '2d73d4f15c0db7f5ecb321b6a65e5d6d'. If someone tampers with your article and publishes it as 'how to use python hashlib - by Bob', you can immediately point out the tampering—because the hash calculated from the altered string will differ from the original.
Clearly, a hash algorithm uses a hash function hash(data) to compute a fixed-length hash digest for arbitrary-length data data, with the goal of detecting whether the original data has been tampered with.
Hash algorithms can detect tampering because hash functions are one-way functions: calculating digest = hash(data) is easy, but reversing the process (deriving data from digest) is extremely difficult. Furthermore, modifying even a single bit of the original data will result in a completely different hash value.
Let’s use the common MD5 hash algorithm to calculate the MD5 value of a string:
import hashlib
md5 = hashlib.md5()
md5.update('how to use md5 in python hashlib?'.encode('utf-8'))
print(md5.hexdigest())
The output is:
d26a53750bc40b38b65a520292f69306
For large datasets, you can call update() in chunks—the final result remains the same:
import hashlib
md5 = hashlib.md5()
md5.update('how to use md5 in '.encode('utf-8'))
md5.update('python hashlib?'.encode('utf-8'))
print(md5.hexdigest())
Try changing a single letter and observe how the result changes completely.
MD5 is the most common hash algorithm: it is fast, and generates a fixed 128-bit/16-byte result (typically represented as a 32-character hexadecimal string).
Another common hash algorithm is SHA1, which is invoked almost identically to MD5:
import hashlib
sha1 = hashlib.sha1()
sha1.update('how to use sha1 in '.encode('utf-8'))
sha1.update('python hashlib?'.encode('utf-8'))
print(sha1.hexdigest())
SHA1 produces a 160-bit/20-byte result (usually a 40-character hexadecimal string).
More secure alternatives to SHA1 are SHA256 and SHA512—however, more secure algorithms are slower and produce longer hash values.
Is it possible for two different sets of data to produce the same hash via a given algorithm? Absolutely. All hash algorithms map an infinite set of data to a finite set of hashes. This scenario is called a collision. For example, Bob could attempt to create an article 'how to learn hashlib in python - by Bob' whose hash matches yours exactly. While not impossible, this is extremely difficult to achieve.
Where are hash algorithms used? A common example:
Any website that allows user login must store usernames and passwords. How should these credentials be stored? One approach is to save them in a database table:
| name | password |
|---|---|
| michael | 123456 |
| bob | abc999 |
| alice | alice2008 |
Storing passwords in plaintext is risky: if the database is compromised, hackers gain access to all user passwords. Additionally, website administrators can access the database and view all plaintext passwords.
The correct approach is to store hashed passwords (e.g., MD5) instead of plaintext:
| username | password |
|---|---|
| michael | e10adc3949ba59abbe56e057f20f883e |
| bob | 878ef96e86145580c38c87f0410ad153 |
| alice | 99b1c2188db85afee403b1536010c2c9 |
When a user logs in:
Write a function to calculate the MD5 hash of a user-input password (for database storage):
def calc_md5(password):
pass
Storing MD5 hashes ensures that even administrators with database access cannot view plaintext passwords.
Design a function to validate user logins (return True/False based on password correctness):
db = {
'michael': 'e10adc3949ba59abbe56e057f20f883e',
'bob': '878ef96e86145580c38c87f0410ad153',
'alice': '99b1c2188db85afee403b1536010c2c9'
}
def login(user, password):
pass
# Test cases:
assert login('michael', '123456')
assert login('bob', 'abc999')
assert login('alice', 'alice2008')
assert not login('michael', '1234567')
assert not login('bob', '123456')
assert not login('alice', 'Alice2008')
print('ok')
Is storing MD5 hashes completely secure? Not necessarily. Suppose you are a hacker who has obtained a database of MD5 hashed passwords—how would you reverse-engineer the plaintext passwords? Brute-force cracking is time-consuming, and real hackers avoid this approach.
Consider this scenario: many users choose simple passwords like 123456, 888888, or password. Hackers can precompute MD5 hashes for these common passwords to create a reverse lookup table:
hash_to_plain = {
'e10adc3949ba59abbe56e057f20f883e': '123456',
'21218cca77804d2ba1922c33e0151105': '888888',
'5f4dcc3b5aa765d61d8327deb882cf99': 'password',
'...': '...'
}
With this table, hackers can match database MD5 hashes to plaintext passwords without cracking—compromising accounts with weak passwords instantly.
Users should avoid simple passwords, but can we strengthen security at the program level?
Since MD5 hashes of common passwords are easy to precompute, we need to ensure stored hashes do not match these precomputed values. This is achieved by appending a complex string to the original password—commonly known as salting:
def calc_md5(password):
return get_md5(password + 'the-Salt')
With salted MD5 hashes, even simple passwords are hard to reverse-engineer—as long as the salt remains secret.
However, if two users use the same simple password (e.g., 123456), their salted MD5 hashes will still be identical (revealing shared passwords). Can we store unique MD5 hashes for users with identical passwords?
Assuming usernames are immutable, we can include the username as part of the salt when calculating MD5—ensuring unique hashes even for identical passwords.
Simulate user registration by calculating a more secure MD5 hash from the username and password:
db = {}
def register(username, password):
db[username] = get_md5(password + username + 'the-Salt')
Implement login validation using the modified MD5 algorithm:
import hashlib, random
class User(object):
def __init__(self, username, password):
self.username = username
self.salt = ''.join([chr(random.randint(48, 122)) for i in range(20)])
self.password = get_md5(password + self.salt)
db = {
'michael': User('michael', '123456'),
'bob': User('bob', 'abc999'),
'alice': User('alice', 'alice2008')
}
def get_md5(user, pws):
return ???
def login(username, password):
user = db[username]
return user.password == get_md5(user, password)
# Test cases:
assert login('michael', '123456')
assert login('bob', 'abc999')
assert login('alice', 'alice2008')
assert not login('michael', '1234567')
assert not login('bob', '123456')
assert not login('alice', 'Alice2008')
print('ok')
Hash algorithms have widespread applications across many fields. It is important to note that hash algorithms are not encryption algorithms: they cannot be used for encryption (since plaintext cannot be derived from a hash), only for tamper detection. However, their one-way computation property enables password validation without storing plaintext credentials.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import hashlib
md5 = hashlib.md5()
md5.update("how to use md5 in python hashlib?".encode("utf-8"))
print(md5.hexdigest())
sha1 = hashlib.sha1()
sha1.update("how to use sha1 in ".encode("utf-8"))
sha1.update("python hashlib?".encode("utf-8"))
print(sha1.hexdigest())