We Used MD5 for File Integrity Checks. Here's Why We Switched to SHA-256

The audit finding

Our quarterly security audit came back with a finding in our document processing microservice: we were using MD5 to generate checksums for uploaded contract files. The auditor's severity rating was "Medium" — not critical, but not ignorable. The finding read: "MD5 is cryptographically broken and should not be used for security-relevant integrity verification. MD5 collisions can be constructed in seconds on consumer hardware."

My first reaction was defensive. We weren't using MD5 for passwords (we'd already switched those to PBKDF2). We were using it to verify that a 40MB contract file arrived intact after an upload to Azure Blob Storage — the kind of "did the file get corrupted in transit?" check. Was this really a security concern? I spent an afternoon understanding the actual risk before deciding whether the migration was justified or just checkbox compliance.

What MD5 actually is and why it's broken

MD5 produces a 128-bit (16-byte, 32 hex-character) digest. It was designed in 1992 and was widely considered secure through the 1990s. The problem: in 2004, Xiaoyun Wang and Dengguo Feng published a method to find MD5 collisions — two different inputs that produce the same MD5 output — in under an hour on hardware available at the time. By 2007, the technique was refined to the point where collisions could be generated in seconds on a standard PC.

A collision attack means: an adversary can construct two different files where both files produce the identical MD5 hash. In our context, this meant a malicious actor could theoretically substitute a tampered contract document for a legitimate one while keeping the MD5 checksum identical — making our integrity check useless against a targeted attack.

The auditor demonstrated this using a publicly available collision set. They showed us two different PDF files — one contained our standard contract template, the other contained a modified version — and both produced identical MD5 checksums. The modification was subtle (a few bytes in the PDF metadata), but it was there.

What our original code looked like

// Original integrity check code — MD5
public async Task<string> ComputeChecksumAsync(Stream fileStream)
{
    using var md5 = MD5.Create();
    var hashBytes = await md5.ComputeHashAsync(fileStream);
    return Convert.ToHexString(hashBytes).ToLower(); // 32-char hex string
}

// Called during upload to store the checksum alongside the file metadata
public async Task UploadContractAsync(IFormFile file)
{
    using var stream = file.OpenReadStream();
    var checksum = await ComputeChecksumAsync(stream);

    // Store the checksum in the database alongside the blob reference
    await _context.Contracts.AddAsync(new Contract
    {
        BlobName = GenerateBlobName(file.FileName),
        Checksum = checksum,
        ChecksumAlgorithm = "MD5",
        UploadedAt = DateTimeOffset.UtcNow
    });
    // ... upload to blob storage
}

The migration to SHA-256

// Migrated code — SHA-256 using the one-shot API (most efficient for in-memory data)
// SHA256.HashData() was added in .NET 5 and avoids the Create()/using pattern
public static string ComputeChecksum(ReadOnlySpan<byte> fileBytes)
{
    var hashBytes = SHA256.HashData(fileBytes);
    return Convert.ToHexString(hashBytes).ToLower(); // 64-char hex string
}

// For streaming data (large files that shouldn't be fully buffered)
public async Task<string> ComputeChecksumAsync(Stream fileStream)
{
    using var sha256 = SHA256.Create();
    var hashBytes = await sha256.ComputeHashAsync(fileStream);
    return Convert.ToHexString(hashBytes).ToLower();
}

// During upload — now records the algorithm name explicitly
public async Task UploadContractAsync(IFormFile file)
{
    await using var stream = file.OpenReadStream();

    // Compute hash while reading the stream once (before the stream position resets)
    var hashBytes = await SHA256.Create().ComputeHashAsync(stream);
    var checksum = Convert.ToHexString(hashBytes).ToLower();

    stream.Position = 0; // Reset before uploading to blob storage

    await _context.Contracts.AddAsync(new Contract
    {
        BlobName = GenerateBlobName(file.FileName),
        Checksum = checksum,
        ChecksumAlgorithm = "SHA-256",  // explicit — so we can migrate again later
        UploadedAt = DateTimeOffset.UtcNow
    });
    await _blobContainerClient.GetBlobClient(blobName).UploadAsync(stream);
    await _context.SaveChangesAsync();
}

The migration required two additional changes beyond the hash function itself:

Database migration: The checksum column was a CHAR(32) (MD5 hex length). SHA-256 produces 64 hex characters. We altered the column to CHAR(64) or NVARCHAR(64).
Re-checksumming existing files: Files already in storage had MD5 checksums stored. We wrote a background job that re-downloaded each blob from Azure Storage, computed the SHA-256, and updated the database record. For our ~80,000 contracts, this took about 4 hours.

When MD5 is still acceptable

The audit finding was correct for our use case, but MD5 is not universally banned. It remains appropriate where collision resistance is not required:

HTTP ETag headers: Generating an ETag as the MD5 of a response body is fine. An adversary cannot inject a tampered response with the same ETag through your own server, so collision resistance is irrelevant here.
Cache keys: Using MD5 to hash a cache key string into a fixed-length identifier for Redis or Memcached is acceptable. You control the inputs; no adversary can construct a collision against a key you generate internally.
Content deduplication: Using MD5 to detect duplicate uploads when you trust the source (e.g., an internal batch job) is fine. The risk is only present when an external party can influence what file is hashed.

// MD5 is still fine for cache key hashing (non-security context)
public static string GetCacheKey(string input)
{
    var bytes = Encoding.UTF8.GetBytes(input);
    var hash = MD5.HashData(bytes); // .NET 5+ one-shot API
    return Convert.ToHexString(hash); // 32-char hex — compact cache key
}

// SHA-256 for anything involving external data or security guarantees
public static string GetFileIntegrityHash(byte[] fileBytes)
{
    var hash = SHA256.HashData(fileBytes);
    return Convert.ToHexString(hash); // 64-char hex
}

Why the algorithm name matters for future migrations

One decision I'm glad we made during the migration: storing the algorithm name (ChecksumAlgorithm = "SHA-256") alongside the checksum value in the database. MD5 is broken, but SHA-256 won't be unbreakable forever. Storing the algorithm name makes a future migration to SHA-3 or BLAKE3 trivially queryable — you can find all records still using the old algorithm with a single WHERE ChecksumAlgorithm = 'SHA-256' query and re-process them.

Without the algorithm name, you'd have to infer it from the checksum length (32 chars → MD5, 40 → SHA-1, 64 → SHA-256), which works until someone stores a truncated hash or base64 encodes it instead of hex-encoding it. Name the algorithm. You'll thank yourself in three years.

The generalizable lesson

The question "is MD5 safe for this?" has a nuanced answer. If you control all inputs and no adversary can influence what gets hashed, MD5's collision vulnerability is irrelevant — use it freely for cache keys, ETags, and internal checksums where performance matters. If an external party provides the data being hashed, and a collision could let them substitute one piece of data for another while keeping the hash identical, MD5 is unsafe and SHA-256 is the minimum acceptable standard.

In our case, users uploaded the contract files. We were using the checksum to verify integrity. An adversary who could construct a malicious file that produced the same MD5 as a legitimate contract could theoretically swap one file for the other in our pipeline. That's the threat model. For that threat model, MD5 is broken. For the ETag threat model (where you generated the hash from your own response), it's fine.

Use the Hash Generator to compare the output format and length of MD5 vs SHA-256 for the same input, and to generate test vectors for your own implementation.

We Used MD5 for Our File Integrity Checks. Here's Why We Switched to SHA-256.