TL;DR: I have an CMS system that stores attachments (opaque files) using SHA-1 of the file contents as the filename. How to verify if uploaded file really matches one in the storage, given that I already know that SHA-1 hash matches for both files? I'd like to have high performance.
Long version:
When an user uploads a new file to the system, I compute SHA-1 hash of the uploaded file contents and then check if a file with identical hash already exists in the storage backend. PHP puts the uploaded file in /tmp
before my code gets to run and then I run sha1sum
against the uploaded file to get SHA-1 hash of the file contents. I then compute fanout from the computed SHA-1 hash and decide storage directory under NFS mounted directory hierarchy. (For example, if the SHA-1 hash for a file contents is 37aefc1e145992f2cc16fabadcfe23eede5fb094
the permanent file name is /nfs/data/files/37/ae/fc1e145992f2cc16fabadcfe23eede5fb094
.) In addition to saving the actual file contents, I INSERT
a new line into a SQL database for the user submitted meta data (e.g. Content-Type
, original filename, datestamp, etc).
The corner case I'm currently figuring out is the case where a new uploaded file has SHA-1 hash that matches existing hash in the storage backend. I know that the changes for this happening by accident are astronomically low, but I'd like to be sure. (For on purpose case, see https://shattered.io/)
Given two filenames $file_a
and $file_b
, how to quickly check if both files have identical contents? Assume that files are too big to be loaded into memory. With Python, I'd use filecmp.cmp()
but PHP does not seem to have anything similar. I know that this can be done with fread()
and aborting if a non-matching byte is found, but I'd rather not write that code.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…