Table of Contents
Introduction to HashingString Hashing
One-Way Hashing
Reverse Hashing
Reverse Hashing Algorithm
Actor File Hashes
What Next?
The Epiphany
Introduction to Hashing
A hash function is an algorithm that maps large data sets (known as keys) into hashes. In Diablo 3, the keys are strings which is why this section is titled "String Hashing". Hashes (also called hash values) in Diablo 3 are stored as unsigned 32-bit integers, and can be represented in decimal (base 10) as a number between 0 and 4294967295, or, more commonly, represented in hexadecimal (base 16) as a number between 0x0 and 0xFFFFFFFF.
A hash function should be referentially transparent; i.e., if the hash function is called twice on input that is "equal" (e.g., strings that consist of the same sequence of characters), it will give the same result.
Some hash functions may map two or more keys to the same hash value, causing a collision. Keys is Diablo 3 never collide. Consequently, topics such hash buckets are beyond the scope of this article.
Diablo 3 makes extensive use of string hashes as a way to correlate data. For example, items are correlated to their type using the "item type hash"; affix names as they appear in-game are correlated to the affix using the "affix string hash". Without understanding hashes, it is impossible to understand how many client game files are related.
String Hashing
Many of the "id" values used through the client game files are actually hashes of string values (typically one of the non-NLS string keys). The hashes are computed as follows:
function DWord d3_hash (string stringKey) {
DWord hash = 0;
char c;
foreach (c in stringKey) {
hash = (hash * 0x21) + ord( c.toLowerCase() );
}
return hash;
}
You can use this (pseudo-code) algorithm to compute Diablo 3 string hashes. Note that a "DWord" is an unsigned 32-bit integer. The term "DWord" means "double word", where a "word" is 2 "bytes", and a "byte" is 8 "bits". Multiply it out, and you find that a "DWord" is indeed 32-bits.
Below is an example using the item type "Ring". Note that all letters are converted to lower-case as part of the hashing algorithm. This little detail will become of great importance when we get to the hashes in the Actor file.
| hash value | char value | calculation |
|---|---|---|
| 00000000 | r = 0x72 | hash = (00000000 * 0x21) + 0x72 |
| 00000072 | i = 0x69 | hash = (00000072 * 0x21) + 0x69 |
| 00000F1B | n = 0x6E | hash = (00000F1B * 0x21) + 0x6E |
| 0001F2E9 | g = 0x67 | hash = (0001F2E9 * 0x21) + 0x67 |
| 00405070 | hash |
So why use string hashing at all? The short answer is because dealing with hashes (which are 32-bit unsigned integers) is much faster than dealing with strings. When dealing with large data sets, hashes provide a very efficient way of accessing data (assuming a good hash function). The Wikipedia Hash function page provides additional details about the application of hashes and their performance characteristics.
Please note that the following sections are not necessary for a good understanding of Diablo 3 string hashing. If you're not a math geek or don't enjoy reading about the magical discovery process, feel free to skip straight to The Epiphany! Be warned that if you do, you will miss a fascinating tale about struggle and survival in the information (overload) age.
One-Way Hashing
A hash functions is a one-way function, that is, a function that is easy to compute on every input, but hard to invert given the image of a random input. This concept forms the basis of most cryptographic techniques. However, Blizzard did not implement string hashes for the purpose of obfuscation, but rather for performance. As a result, the hashing function that they chose to use a simple and effective.
That being said, for long enough strings, it is extremely difficult to determine the original string key given only its hash value. One of the reasons for this is that as you hash, you "lose" some information. In fact, it is this "compression" of information (from a long string into a much smaller 32-bit integer) that allows a hash value to be an effective way of indexing and correlating information.
Once a string reaches a length of 7 or greater, its hash value will begin to "lose" information. Let us illustrate this with a practical example, the hash for the item type "Crossbow".
| hash value | char value | calculation |
|---|---|---|
| 00000000 | c = 0x63 | hash = (00000000 * 0x21) + 0x63 |
| 00000063 | r = 0x72 | hash = (00000063 * 0x21) + 0x72 |
| 00000D35 | o = 0x6F | hash = (00000D35 * 0x21) + 0x6F |
| 0001B444 | s = 0x73 | hash = (0001B444 * 0x21) + 0x73 |
| 00383D37 | s = 0x73 | hash = (00383D37 * 0x21) + 0x73 |
| 073FE48A | b = 0x62 | hash = (073FE48A * 0x21) + 0x62 |
| EF3C762C | o = 0x6F | hash = (EF3C762C * 0x21) + 0x6F |
| D6CB3C1B | w = 0x77 | hash = (D6CB3C1B * 0x21) + 0x77 |
| B032BFF2 | hash |
The last two multiplications (when the 'o' and 'w' characters are processed) cause some high order bits to be dropped. Let us examine these in close detail using 64-bit wide arithmetic. The result in the left-most column represents the 64-bit value of the hash function. However, since the hash is stored as a 32-bit value, any bit beyond the 32nd bit (the higher order bits) are dropped. These bits are shown in red below.
| hash value | char value | calculation |
|---|---|---|
| 00000000 073FE48A | b = 0x62 | hash = (073FE48A * 0x21) + 0x62 |
| 00000000 EF3C762C | o = 0x6F | hash = (EF3C762C * 0x21) + 0x6F |
| 0000001E D6CB3C1B | w = 0x77 | hash = (D6CB3C1B * 0x21) + 0x77 |
| 0000001B B032BFF2 | hash |
As the hash function iterates over the characters in the key string, it drops all high order bits that do not fit in its 32-bit value hash. In the example above, when the character 'o' is processed, the resulting hash (with 64-bit precision) is 0x1ED6CB3C1B, but the leading "1E" is truncated and lost. When the 'w' character is then added, the hash value that is multiplied is the 32-bit (truncated) value.
In this simple example, we lost 2 bytes worth (or more precisely, about 12 bits) of data.
Reverse Hashing
The process of reverse hashing involves starting with a hash and applying a string in reverse order on that hash. In other words, the reverse hashing function is the inverse of the hashing function. If you read the previously section carefully, you are probably wondering how this is possible given that the hashing function loses information as it hashes.
Before we get into the details, let us first address the important question of why do we even care? Curiosity? Perhaps if you are math geek with a lot of spare time and a passion for investigating problems that don't really need solving. As it turns out, there are some practical reasons too.
Several folks have been trying to find the link between items and their 2D images (as they show up in your inventory or character page). We know that the 2D images of items are stored in Texture (*.tex) files. But there appears to be no correlationbetween the Items*.gam files and the Texture file itself. However, the records in the Items*.gam files each reference an Actor (*.acr) file. Currently, many items reference the same Actor file which is why you see many items sharing the same artwork. I am sure that this will not be the case when the game is live, but I digress.
So is there a relationship between the Actor file and the Texture file? As it turns out, there is. Starting at (absolute) file offset 0x320 in the Actor file, there 10 DWords that contain hashes. These 10 DWords correspond to the 10 class/gender combinations. In order, they correspond to the following pairs: DH, Barb, Wiz, WD, Monk, with the first DWord in each pair being for the male, and the second (if present) for the female. Items such as rings and weapons have the same values for all 5 classes.
As it turns out, these (up to) 10 hashes in the Actor files don't show up anywhere else in any of the client game files. Yet I was convinced that there was a link between these values (which I assumed were some kind of hash) and the images in the Texture files.
Initially, I tried hashing the image names (with and without some of the internal information from the Texture files), but no luck. That is, except for the "Voodoo Mask" (Witch Doctor) images. Surely this couldn't just be a coincidence. Statistically speaking, you are much more likely to win the PowerBall twice in a row than you are to have 4 random strings match 4 given hash values.
So now that we know that the image file name was somehow part of the string that yielded the hash, where do we go from here? The answer was reverse hashing. If we could figure out a way to start with a given hash (from the Actor) file, and work backwards by reverse hashing the image name, we might crack this difficult nut. And thus began the journey into reverse hashing...
Reverse Hashing Algorithm
I knew my string did not hash into the value from the Actor file. That meant that the actual string key had some other information in its key. One possible explanation was that it had some kind of string prefix before the image name. If we assumed that this string prefix hashed into a "prefix hash", then applying our image name string to this prefix hash would yield the hash value from the Actor file.
At first, I used a brute force approach the reverse hashing. That is, I itererated over every single possible prefix hash value (i.e., every value from 0x0 to 0xFFFFFFFF) to see if any starting prefix hash would give the Actor hash value when applying the image name. It turns out, many items (but not all) had the same prefix hashes. This was a very promising discovery. But there was one major problem. My brute force approach was very, very slow. It was taking several hours just to obtain the prefix of a couple of items, and I had a list of almost 800 images.
During this time, I was in regular contact with chippydip who, I quickly discovered, is no slouch at math. He realized that the code could be optimized because instead of running the hash algorithm every time, you could pre-calculate how much a +1 difference in the prefix hash would change the final hash and then just add that amount every time. Let us look at an example to see why this last statement is true.
Let us assume that 'p' is our prefix hash and 's' is the suffix string we are attempting to reverse hash. For simplicity's sake, let us look at the different for 'p' and 'p+1' when 's' is of size 1 (with character c0).
d3_hash(p,s) = ( p * 0x21 ) + c0
d3_hash(p+1,s) = ( ( p + 1 ) * 0x21 ) + c0
= ( p * 0x21 ) + 0x21 + c0
d3_hash(p+1,s) - d3_hash(p,s) = 0x21
Now let use examine the hashes when 's' is size 2 (with characters c0 and c1).
d3_hash(p,s) = ( ( p * 0x21) + c0 ) * 0x21 + c1
= ( p * 0x21^2 ) + ( 0x21 * c0 ) + c1
d3_hash(p+1,s) = ( ( ( p + 1 ) * 0x21) + c0 ) * 0x21 + c1
= ( p * 0x21^2 ) + ( 0x21^2 ) + ( 0x21 * c0 ) + c1
d3_hash(p+1,s) - d3_hash(p,s) = 0x21^2
If you extend this formula, you will see that the different between the hashes of 'p+1' and 'p' for a given string 's' is equal to 0x21 to the power of 'x', where 'x' is the length of the string 's' being hashed. This approach greatly reduces the number of mathematical operations required to determine the reverse hash.
But that was just the beginning! It turns out that chippydip would go one step further and discover the following. Note that hexadecimal 0x21 is equal to decimal 33.
We want to undo the "hash * 33 + c" operation used in the forward hashing function. We know the result and we guess the value of "c", so essentially we just need to compute "(newHash - c) / 33". Obviously this won't quite work because what we really want is "(overflow + newHash - c) / 33", but it turns out that we can figure out what "overflow" was based on the remainder of "(newHash - c) / 33".
"overflow" can be anywhere from 0 to 33 multiples of 2^32. Since we know that "overflow + newHash - c" must be evenly divisable by 33, and that 2^32 and 33 are relatively prime, each value of overflow will result in a unique remainder.
(Note that 0 * 2^32 and 33 * 2^32 are the exception and will both have a remainder of 0, but we can ignore this since "(33 * 2^32 + newHash - c) / 33 = 2^32 + (newHash - c) / 33" so the result is the same either way.)
With that explanation out of the way, our goal is now simply to find which overflow multiple from 0-32 will make our remainder zero. I'm sure there are multiple ways to do this (including a brute-force search of all 33 possibilities) but what I'm doing is taking the fact that "2^32 = 4 mod 33" and looking for how many multiple of 4 I need to add to my original remainder to get it to be a multiple of 33. I do this by trying to hit the next highest multiple of 33 and, if that doesn't work, realize that every 8 multiples of 2^32 I add will reduce the remainder mod 33 by 1 (4 * 8 = -1 mod 33). Once we have the overflow value, its a simple matter of adding "overflow / 33" to our original result value and we're done.
The key insight in the above is the fact that 2^32 and 33 are relatively prime. The following (pseudo-code) algorithm allows for the computation of the reverse hash:
function DWord d3_reverse_hash (DWord hash, string suffix) {
foreach (var c in suffix.ToLowerInvariant().Reverse()) {
var prevHashValue = hash;
// oldHash * 33 + c = newHash mod 2^32
// oldHash = (newHash - c) / 33 mod 2^32
var remainder = (hash - c) % 33;
hash = (hash - c) / 33;
// We want remainder = 0 mod 33 so find how many multiples of 2^32 we need to add
if (remainder != 0) {
// Calculate the number of multiples of 2^32 to add
var n = (33 - remainder) / 4; // see if we can hit the next multiple of 33
var offset = remainder + n * 4; // if not, see how far we overshot
n += offset * 8; // every 8 * 2^32 decreases our remainder by 1 (33 - 8 * 4)
// Update the remainder
remainder += 4 * n;
Contract.Assert(remainder % 33 == 0);
// Add the proper multiple of 2^32 (plus the combined remainders)
hash += n * 130150524u; // 130150524u = 2^32 / 33
hash += remainder / 33;
}
Contract.Assert(hash * 33 + c == prevHashValue);
}
return hash;
}
Actor File Hashes
Applying the reverse hashing function to the hashes (discussed above) from the Actor (*.acr) files, we notice the several different item image names from the Texture (*.tex) files have the same reverse hash:
| Actor Hash | Texture Image Name | Reverse Hash |
|---|---|---|
| 6722DE36 | Shield_norm_base_01_icon | 3E0F83E0 |
| 697806D7 | Shield_norm_base_02_icon | 3E0F83E0 |
| ... | 3E0F83E0 | |
| 5EF78F84 | Sword_norm_base_01 | 3E0F83E0 |
| 5EF78F85 | Sword_norm_base_02 | 3E0F83E0 |
| ... | 3E0F83E0 | |
| 3FA60847 | Wand_norm_base_01_icon | 3E0F83E0 |
| 41FB30E8 | Wand_norm_base_02_icon | 3E0F83E0 |
For items that have different icons between the classes, the DH and WD have the same reverse hash, while the Barb, Monk, and Wizard share the same reverse hash.
| Actor Hash | Texture Image Name | Reverse Hash |
|---|---|---|
| 838AB107 | Belt_norm_base_01_DH | DCA81FA0 |
| 838AB376 | Belt_norm_base_01_WD | DCA81FA0 |
| 90FA6F72 | Belt_norm_base_01_Barb | AAD0CFC0 |
| 9100B2B0 | Belt_norm_base_01_Monk | AAD0CFC0 |
| EAE59B0C | Belt_norm_base_01_Wizard | AAD0CFC0 |
Another observation is that for a given class, items from normal and hell difficulty have the same prefix hash, while items from nightmare difficulty have a different prefix hash.
| Actor Hash | Texture Image Name | Reverse Hash |
|---|---|---|
| 838AB107 | Belt_norm_base_01_DH | DCA81FA0 |
| 838B3D68 | Belt_norm_base_02_DH | DCA81FA0 |
| EDE4A92A | Belt_nightmare_base_01_DH | 4427C7A0 |
| EDE5358B | Belt_nightmare_base_02_DH | 4427C7A0 |
| 522E47D0 | Belt_hell_base_01_DH | DCA81FA0 |
| 522ED431 | Belt_hell_base_02_DH | DCA81FA0 |
So what's the common link between the prefix strings? I don't know... yet. With the help of reverse hashing, I hope to be able to figure out the full strings that makes up the Actor file hashes. It's clear that there is a link between these hashes and the image names. It's just a matter of time to solve the prefix string.
What Next?
After failing to find a clear pattern to the Actor hash prefix strings, I have turned to some more brute force techniques to find the missing link. When all you have is a hammer, maybe you need to find an even bigger hammer.
One approach I have taken is to try to find some commonality in the prefix between two similar items with different prefix hashes, for example, "Belt_norm_base_01_DH" and "Belt_norm_base_01_Barb". I do this by generating strings of a certain length and reverse hashing them from the prefix hash. I then check to see if this new hash is equal to the other prefix hash. If I can find a match, I will be able to (slowly) work backwards to determine the full prefix string.
Unfortunately, I have not yet found a match using strings up to length 5. And the longer the string, the longer it takes to reverse hash all of the possible values. The possible characters used in the hash string are: the 26 alphabet characters (a-z), the 10 digits (0-9), and a handful of special characters (e.g., the underscore). I am using a character set of 39 possible characters. As such, a string of length 5 has 20,511,149 possible permutations. When you move up to length 6, the number of permutations jumps to 799,934,811.
To help with these reverse hash computations, I have decided to pursue an OpenCL implementation. "OpenCL (Open Computing Language) is an open royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors, giving software developers portable and efficient access to the power of these heterogeneous processing platforms." Basically, I plan to turn my Radeon HD 5850 video card into a highly optimized reverse hashing machine.
I am developing this code using the MinGW C/C++ compilers under Windows. No, I am not a Visual Studio developer, and have (thankfully) managed to program for the last two decades (I'm not that old, I just started real young) without ever having to write Microsoft native code. Feel free to contact me directly if you are interested in getting up and running with OpenCL on Windows with MinGW.
The Epiphany
Turns out the answer to the Actor hash dilemma was so obvious, that I only found it by accident.
I was in the process of writing OpenCL code to run reverse hashing for larger and larger strings on my high end graphics card. In the process, I was re-writing a lot of my hasing code from PHP into C. Yes, I should never been using PHP for this kind of stuff, but the rest of my D3 code was all PHP so I suffered through it which is no small miracle considering PHP doesn't natively support unsigned ints (so I had to write my own uint32 conversion functions and deal with hex strings).
In any case, while porting the hashing code to C, I ran some tests on the normal hashing function. I had cut and paste one of strings ("Belt_norm_base_01_DH") that I had been working on while trying to reverse hash the prefix. Sure enough, the hash I got was exactly what I thought I was expecting; or rather, it was what I thought I was expecting. And then I realized that the hash value I was looking at was actually the same hash as the one from the Actor file!
And that's how I realized that the Actor hashes are case sensitive hashes! Off course, this perfectly explains why the "voodoomask_norm_base_01_icon" items reverse hashed to 0; they were the only image string names that were all lowercase! It also explains why we were getting common prefixes: as we demonstrated earlier, d3_hash(p+1,s) - d3_hash(p,s) = 0x21^x where x = strlen(s). And of course, the ordinal difference between a lowercase and uppercase character is fixed (0x20) hence items that had the same number of uppercase characters in the same location ended up with the same prefix hashes. This also explains why the _DH and _WD (two uppercase characters) classes had one offset, but _Barb, _Wiz, and _Monk (a single uppercase character) had another.
So in the end, I spent a lot of time on a wild goose chase, but what an interesting chase it turned out to be. The reason I was led down this path was because I had made the (unconscious) assumption that all hashes converted the string keys to lowercase before hashing. After all, that's how it worked for all of the other files that I had been working on. It had worked perfectly until the hashes from the Actor files.
If there is a moral to this story, it's probably never to underestimate Occam's Razor.
Special thanks to chippydip from Diablo 3 Lexicon for his discovery of the hashing algorithm and for his tremendous optimizations of reverse hashing.
