Jump to content




Verifying that two files are the same


  • You cannot reply to this topic
49 replies to this topic

#21 awsmazinggenius

  • Members
  • 930 posts
  • LocationCanada

Posted 04 May 2014 - 02:19 PM

Yeah, that just won't cut it, as CC sucks for editing large Lua files, or programs that you want to be polished enough to release.

I've found half of how Git hashes multiple files, here is how Git finds the hash for a single file. It isn't exactly sha1(handle.readAll()).
Edit: forgot to link to it, but it is basically this:
sha1("blob "..filesize.."\0"..filecontent)
--# where \0 is a null byte. I'm not sure how you do null bytes in Lua. 

Edited by awsmazinggenius, 04 May 2014 - 05:45 PM.


#22 skwerlman

  • Members
  • 163 posts
  • LocationPennsylvania

Posted 05 May 2014 - 05:11 PM

View Postawsmazinggenius, on 04 May 2014 - 02:19 PM, said:

Yeah, that just won't cut it, as CC sucks for editing large Lua files, or programs that you want to be polished enough to release.

I've found half of how Git hashes multiple files, here is how Git finds the hash for a single file. It isn't exactly sha1(handle.readAll()).
Edit: forgot to link to it, but it is basically this:
sha1("blob "..filesize.."\0"..filecontent)
--# where \0 is a null byte. I'm not sure how you do null bytes in Lua.
I'm pretty sure lua supports (some? most?) standard C escape sequences, including \nnn (in decimal) for ASCII chars (including null bytes).

#23 blipman17

  • Members
  • 92 posts

Posted 05 May 2014 - 05:42 PM

what about a file that stores the exact quantity every single letter is used in your code? and compare your current code with it? It would as far as I know a bit faster that hashing. Although the possibility for an equal amounth of all characters is more likely than with a good hashing algorithm.

#24 skwerlman

  • Members
  • 163 posts
  • LocationPennsylvania

Posted 05 May 2014 - 05:50 PM

View Postblipman17, on 05 May 2014 - 05:42 PM, said:

what about a file that stores the exact quantity every single letter is used in your code? and compare your current code with it? It would as far as I know a bit faster that hashing. Although the possibility for an equal amounth of all characters is more likely than with a good hashing algorithm.
If someone replaces a line with one of the same length, no difference would be detected.

EDIT: A simple, temporary fix would be to ask whether the user would like to report the error. That way, if someone generates 20 errors in 7 min, they'd have to confirm all 20 error reports. Obviously this won't stop someone from being malicious, but it should help prevent unintentional report spam.

Edited by skwerlman, 05 May 2014 - 06:01 PM.


#25 MKlegoman357

  • Members
  • 1,170 posts
  • LocationKaunas, Lithuania

Posted 05 May 2014 - 05:58 PM

View Postskwerlman, on 05 May 2014 - 05:50 PM, said:

If someone replaces a line with one of the same length, no difference would be detected.

The idea is not having it compare the file size, but how many time different characters appear. But it wouldn't make any difference if characters would only be shuffled around. For ex.:

local id = os.getComputerID()

--// Change that to:

local os.getComputerID = id() --// This one has the same amount of every letter the above one has

It's not likely that someone would change the code to be different but have the same amount of every letter the original code has, but it is still possible.

Edited by MKlegoman357, 05 May 2014 - 06:00 PM.


#26 skwerlman

  • Members
  • 163 posts
  • LocationPennsylvania

Posted 05 May 2014 - 06:05 PM

View PostMKlegoman357, on 05 May 2014 - 05:58 PM, said:

View Postskwerlman, on 05 May 2014 - 05:50 PM, said:

If someone replaces a line with one of the same length, no difference would be detected.

The idea is not having it compare the file size, but how many time different characters appear. But it wouldn't make any difference if characters would only be shuffled around. For ex.:

local id = os.getComputerID()

--// Change that to:

local os.getComputerID = id() --// This one has the same amount of every letter the above one has

It's not likely that someone would change the code to be different but have the same amount of every letter the original code has, but it is still possible.
Oh, I misread that. That's certainly better than checking file size, but it sounds fairly slow, since each char is checked and tallied individually. Remember, OneOS is huge, so we need a relatively fast algorithm.

#27 awsmazinggenius

  • Members
  • 930 posts
  • LocationCanada

Posted 05 May 2014 - 10:01 PM

Depending on the speed of your web server, it might actually be smart to send the files off somewhere. If the OS crashes, you can compress the code and send it off to the web where PHP calculates the md5 and checks it. You would need to have an algorithm to compress the code as best as you can, though, and then you'd need to reimplement it in PHP to decompress the code.

EDIT: The reason I say this is because it seems like calculating the SHA1 of the latest commit on GitHub win't work, because in this hash Git also includes the previous commit's hash. You could, still, though, hash all the files (in this case, since we are not required to use SHA1, you would probably want to use Grav's SHA256 snippet, as (I would think) it has less collisions), concatenate the hashes and then hash again, then send this hash off to the web to check against the one you've calculated for the latest version of OneOS. Again, just a matter of picking and choosing what to hash, but also remembering to recalculate for each release.

Edited by awsmazinggenius, 05 May 2014 - 10:05 PM.


#28 oeed

    Oversimplifier

  • Members
  • 2,095 posts
  • LocationAuckland, New Zealand

Posted 05 May 2014 - 10:04 PM

View Postawsmazinggenius, on 05 May 2014 - 10:01 PM, said:

Depending on the wood of your web server, it might actually be smart to send the files off somewhere. If the OS crashes, you can compress the code and send it off to the web where PHP calculates the md5 and checks it. You would need to have an algorithm to compress the code as best as you can, though, and then you'd need to reimplement it in PHP to decompress the code.
Uploading 1MB on some connections (i.e. every single one in Australia) would take ages. I'll just SHA1 them and compare it to GitHub.

#29 awsmazinggenius

  • Members
  • 930 posts
  • LocationCanada

Posted 05 May 2014 - 10:12 PM

Looking at what you quoted, you haven't seen my edit, as I also fixed an obvious spelling mistake in that edit. Also, I forgot about the varying-internet-speeds problem. Something makes me wonder how you play online games...

#30 oeed

    Oversimplifier

  • Members
  • 2,095 posts
  • LocationAuckland, New Zealand

Posted 05 May 2014 - 10:23 PM

View Postawsmazinggenius, on 05 May 2014 - 10:12 PM, said:

Looking at what you quoted, you haven't seen my edit, as I also fixed an obvious spelling mistake in that edit. Also, I forgot about the varying-internet-speeds problem. Something makes me wonder how you play online games...
Hmm I see. I might just make a hash each release and compare it to that.

I often wonder that too, as do many of the people on Cranium's server. I don't even want to mention what it's like when my brother plays GuildWars 2.....

#31 awsmazinggenius

  • Members
  • 930 posts
  • LocationCanada

Posted 05 May 2014 - 10:29 PM

Yes, that is what will need to happen, as a Git SHA1 is not just the files, apparently. Just SHA256 (using Grav's snippet) all the files, concatenate the hashes in the same order each time (maybe "alphabetically" using 0-9 a-f (I don't know the word), but the only thing is that it is in the same order each time) and hash those concatenates hashes, and send 'em off to your server where you already handle reporting.

#32 theoriginalbit

    Semi-Professional ComputerCrafter

  • Moderators
  • 7,332 posts
  • LocationAustralia

Posted 05 May 2014 - 11:45 PM

View Postawsmazinggenius, on 04 May 2014 - 02:19 PM, said:

sha1("blob "..filesize.."\0"..filecontent)
--# where \0 is a null byte. I'm not sure how you do null bytes in Lua.
yes you're correct with the \0

#33 awsmazinggenius

  • Members
  • 930 posts
  • LocationCanada

Posted 06 May 2014 - 03:14 AM

Essentially this pseudo-code:
(Sorry for mistakes, written on iPad)
local hashes = {}
for _, filename in pairs(filenames) do
  local h = fs.open(filename, "r")
  hashes[(#hashes + 1)] = sha256(h.readAll())
end
local finalhash = sha256(table.concat(hashes))

Edited by awsmazinggenius, 06 May 2014 - 11:56 PM.


#34 MKlegoman357

  • Members
  • 1,170 posts
  • LocationKaunas, Lithuania

Posted 06 May 2014 - 11:45 AM

The problem I see with hashing it with sha256 is that there would be over 30 hash calculations (IIRC). Those files are big too. Wouldn't that be quite slow?

#35 awsmazinggenius

  • Members
  • 930 posts
  • LocationCanada

Posted 06 May 2014 - 11:58 PM

If you have a decent computer, no. And SHA256 has less collisions than SHA1, so, why not?

#36 theoriginalbit

    Semi-Professional ComputerCrafter

  • Moderators
  • 7,332 posts
  • LocationAustralia

Posted 07 May 2014 - 12:01 AM

View Postawsmazinggenius, on 06 May 2014 - 11:58 PM, said:

If you have a decent computer, no. And SHA256 has less collisions than SHA1, so, why not?
Industry practise, file integrity is checked with CRC32, MD5, or SHA1... its because these are enough, anything more is just overkill.

Edited by theoriginalbit, 07 May 2014 - 12:02 AM.


#37 skwerlman

  • Members
  • 163 posts
  • LocationPennsylvania

Posted 07 May 2014 - 03:15 AM

The fastest pure-lua SHA1 implementation I've found is here.
The only 5.1 CRC32 implementation that appears to be CC-compatible (that I could find) is here. You'll need to comment out the first line of actual code, though. (module('CRC32', package.seeall))
Finally, the only pure-lua implementation of MD5 written for 5.1 (again, that I could find) is here.

All three appear to be released under the MIT license.

I hope one of these works well enough for this application.

EDIT: I forgot to mention that I haven't had time to actually test them in CC.

Edited by skwerlman, 07 May 2014 - 03:31 AM.


#38 oeed

    Oversimplifier

  • Members
  • 2,095 posts
  • LocationAuckland, New Zealand

Posted 07 May 2014 - 05:16 AM

On the performance aspect of this, it's worth noting that GravityScore's version actually runs quicker than the SHA1 implementations he's tried.

#39 theoriginalbit

    Semi-Professional ComputerCrafter

  • Moderators
  • 7,332 posts
  • LocationAustralia

Posted 07 May 2014 - 06:02 AM

Honestly oeed I think your best method is to just do CRC's for each version, have a folder in the repo for the CRCs, each file has the system version number, the contents of the file are the full path of the file and the CRC for it. Download that CRC file and compare against the system. That was the easiest and quickest solution that NeverCast and I could find to implement for CCTube.

View Postskwerlman, on 07 May 2014 - 03:15 AM, said:

The fastest pure-lua SHA1 implementation I've found is here.
That is extremely slow! This is the fastest SHA1 implementation I've found, and its made by someone on these forums; its a near instant calculation compared to the one you've linked. I've also performed some cleanup on the code which can be found here.

View Postskwerlman, on 07 May 2014 - 03:15 AM, said:

The only 5.1 CRC32 implementation that appears to be CC-compatible (that I could find) is here. You'll need to comment out the first line of actual code, though. (module('CRC32', package.seeall))
Another cleanup to get it working nicely in ComputerCraft found here ;)

Edited by theoriginalbit, 07 May 2014 - 06:25 AM.


#40 skwerlman

  • Members
  • 163 posts
  • LocationPennsylvania

Posted 07 May 2014 - 05:53 PM

View Posttheoriginalbit, on 07 May 2014 - 06:02 AM, said:

Honestly oeed I think your best method is to just do CRC's for each version, have a folder in the repo for the CRCs, each file has the system version number, the contents of the file are the full path of the file and the CRC for it. Download that CRC file and compare against the system. That was the easiest and quickest solution that NeverCast and I could find to implement for CCTube.

View Postskwerlman, on 07 May 2014 - 03:15 AM, said:

The fastest pure-lua SHA1 implementation I've found is here.
That is extremely slow! This is the fastest SHA1 implementation I've found, and its made by someone on these forums; its a near instant calculation compared to the one you've linked. I've also performed some cleanup on the code which can be found here.

View Postskwerlman, on 07 May 2014 - 03:15 AM, said:

The only 5.1 CRC32 implementation that appears to be CC-compatible (that I could find) is here. You'll need to comment out the first line of actual code, though. (module('CRC32', package.seeall))
Another cleanup to get it working nicely in ComputerCraft found here ;)
Wow, that SHA1 routine is stupid fast compared to the ones I've seen! Nice find!

Doesn't removing the license info constitute a violation of the license?

MIT License said:

--The above copyright notice and this permission notice shall be included in all
--copies or substantial portions of the Software.






1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users