- Timestamp:
- 2000-07-13T10:21:53+12:00 (24 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
branches/New_Config_Format-branch/gsdl/src/hashfile/hashfile.cpp
r915 r1279 143 143 144 144 145 146 // create a hash string from the contents of a file. 147 // 148 // The file is treated as a large base 256 number (each char is a digit), 149 // and the hash value is the remainder when this number is divided by a 150 // very large prime. 151 // 152 // PROBLEM: This is a flawed hash function because the rightmost (highest) 153 // value in remainder is more likely to be a "1" than it is to be any other 154 // number. 155 // 156 // EVIDENCE: About 50% of the files in any GSDL directory have a hash code 157 // that starts with 01* which implies (based on the my_convert_num function) 158 // that the rightmost "digit" of the remainder = z such that (z % 16 = 0) 159 // and (z / 16 = 1) which means (z = 1). 160 // 161 // MAJOR REASON: suppose our prime number was 19. Then, if we have a 162 // reasonably random distribution of numbers for which we are going to 163 // calculate (N % 19), we expect to get a roughly uniform distribution 164 // of remainders where the possible values are 0, 1, 2, 3... 18. The 165 // problem is that 9 out of the 19 possible values (10,11... 18) start 166 // with the digit 1. Thus our hascode will start with "01". 167 // 168 // ANOTHER PROBLEM: Characters in the file are read one at a time; after each 169 // one is read it is prepended to the remainder, then the remainder is 170 // recalculated on the string thus far seen. I am sure the math here 171 // is wrong - if I try calculating (111 mod 7) by the same algorithm, it 172 // simply does nor work. 173 // 174 // ANOTHER POSSIBLE PROLEM: Each character from the file is read into 175 // remainder at the most significant end, and when that character 176 // is a zero you get a number like "01" which would be considered 177 // larger than a number like "8" because it is longer (two digits 178 // instead of 1). 179 // 180 // These comments added by Gordon Paynter ([email protected]) in 181 // June 2000. I didn't write any code, however. 182 145 183 char *hashfile (char *filename) { 146 184 FILE *infile = (FILE *)NULL; 147 185 int i; 148 186 149 // calculate the 8 multiples of the prime number to use150 // in the long division151 187 number primepow[8]; 152 188 number pow; … … 163 199 pow.len = 12; 164 200 201 // calculate 8 multiples of the prime number. 202 // These are used to find the remainder using only subtraction operations 165 203 for (i=0; i<8; i++) { 166 204 primepow[i] = pow; … … 168 206 } 169 207 208 // The "remainder" after division by the prime. Our result. 209 remainder.len = 0; 210 211 // open the file 170 212 infile = fopen (filename, "rb"); 171 213 if (infile == NULL) { 172 214 return (char *)NULL; 173 215 } 216 c = (unsigned char)fgetc(infile); 174 217 175 remainder.len = 0;176 c = (unsigned char)fgetc(infile);177 218 while (!feof (infile)) { 178 // remainder = remainder * 256 + c 219 220 // make sure the remainder has not grown too large 179 221 if (remainder.len == MAXNUMLEN-1) { 180 222 fprintf (stderr, "ERROR - number overflow\n"); 181 223 return (char *)NULL; 182 224 } 225 226 // remainder = remainder * 256 + c 183 227 for (i=remainder.len; i>0; i--) { 184 228 remainder.num[i] = remainder.num[i-1]; 185 229 } 186 230 remainder.num[0] = c; 187 231 if (remainder.len > 0 || c != 0) remainder.len = remainder.len+1; 188 232 233 // remainder = (remainder % large-prime-number) 189 234 for (i=7; i>=0; i--) { 190 235 my_ifpos_dec (remainder, primepow[i]); 191 236 } 192 237 238 // read a new character from the file 193 239 c = (unsigned char)fgetc(infile); 194 240 }
Note:
See TracChangeset
for help on using the changeset viewer.