Hi all,
I need to compare 1 file to many (say, 100 max). Even if their names differ, if the data inside them (specifically the cells, number of rows, perhaps some formatting too) are precisely the same,
i must throw a flag. If the names are exact, i think an MD5 hash will probably work fine. Disallowing two like named files into the import process is not an option because the chances the files are named the same is probable. Forcing the user to rename every time is just not a desire-able approach.
The filetypes may take 1 of 3 formats: .xls, .xlsx, .csv, and contain 100 to 5000 rows, up to 50 columns, all numeric data
If i need to make things easier, i can limit the scope of this project to .csv only.
To be more clear:
The user will be importing a file for manipulation, should the file be the same as an already imported source, the user should know of this before proceeding.
Other thoughts:
Maybe i am overthinking this but...
Option 1: Hash
I would prefer to have some sort of hash string because ultimately i think thats the fastest way, I can just generate the hash for the source and compare to a hash table of existing.
But i don't think this is possible with any hash algorithm.
Option 2:
I would write a function to check each individual cell for binary comparison, but that seems kind of cumbersome and possibly even painfully slow.
Maybe this is unrealistic. Curious on your thoughts.
Thanks
Bookmarks