If you are a programmer, you must have heard the term “hash function”. In its most general form, a hash function projects a value from a set with many members to a value from a set with a fixed number of members. We basically convert the input into a different form by applying a transformation function. Hash functions map data of arbitrary length to data of a fixed length. The values returned by a hash function are called hash values. An interesting thing to note is that hash functions are not reversible. This means that you cannot recover the original data from hashed values. This property makes it one of the most useful data structures known to mankind. Hash functions are used extensively in internet security. In this post, we will talk about C++ STL and how to use hash functions with user defined classes.
Why do we care about this?
Almost every single data structure is made available as part of C++ STL. The containers that implement hash tables are std::unordered_set and std::unordered_map. These containers usually work with predefined datatypes like int, float, string, etc. So if you want to use them with a user-defined classes, you need to define the following two things:
- Hash function: It is basically a mathematical operation that defines how we transform the input. We need to specify the rule so that the compiler knows what to do. This must be a class that overrides operator() and calculates the hash value given an object of the key-type. The inbuilt hash function expects a predefined data type to be the input, so that it can hash the value. In our case, we have a custom class. So the compiler won’t know what to do. So we need to specialize the std::hash template for the user-defined class. We need to come up with a hash function that depends on the data inside the user-defined class.
- Comparison function for equality: As we know, all hash tables need to know how to deal with collision efficiently. Wait a minute, what is collision exactly? A collision happens whenever two or more inputs are mapped to the same hash value. If we don’t take care of collision, then our input data might get overwritten. Now you might ask, how is it possible that multiple inputs might get mapped to the same value? Well, that depends on how good your hash function is. The hash cannot rely on the fact that the hash function will always provide a unique hash value for every distinct key, so it needs a way to compare two given keys for an exact match. Now if the input is int or float, it can just directly compare the values. But since we have a custom class, we need to tell it how to compare two instances. You can implement this either as a class that overrides operator(), or as a specialization of std::equal, or by simply overloading operator==() for your key type.
How to do it?
The difficulty with the hash function is that if your key type consists of several members. So now, we need to tell the hash function to calculate hash values for the individual members. We will then combine them into one hash value for the entire object. For good performance, i.e. fewer collisions, you should think carefully about how to combine the individual hash values to ensure you avoid getting the same output for different objects too often. Remember, we cannot recover original data from hashed values! So if we map multiple objects to the same output, it’s gone.
A fairly good starting point for a hash function is the one that uses bit shifting and bitwise XOR to combine the individual hash values. For example, let’s say we have a custom class like this:
class Line { private: float m; float c; public: Line() {m = 0; c = 0;} Line(float mInput, float cInput) {m = mInput; c = cInput;} float getM {return m;} float getC {return c;} void setM(float mInput) {m = mInput;} void setC(float cInput) {c = cInput;} bool operator==(const Line &anotherLine) const { return (m == anotherLine.m && c == anotherLine.c); } };
Now we override the functionality of the hash function with our custom definition:
namespace std { template <> struct hash<Line> { size_t operator()(const Line& k) const { // Compute individual hash values for two data members and combine them using XOR and bit shifting return ((hash<float>()(k.getM()) ^ (hash<float>()(k.getC()) << 1)) >> 1); } }; }
Now in the main function, we can use the hash function as given below:
int main() { unordered_set<Line> t; Line line1 = Line(1.0,2.0); Line line2 = Line(3.0,4.0); Line line3 = Line(3.0,4.0); t.insert(line1); t.insert(line2); t.insert(line3); for(unordered_set<Line>::const_iterator it = t.begin(); it != t.end(); it++) { Line t = *it; cout << t.m << " " << t.c << "\n" ; } return 1; }
It will automatically use std::hash<Line> as defined above for the hash value calculations, and the operator== defined as member function of Line for equality checks. If you see the print out on the command line, you will see that “line3” is not printed because it’s the same as “line2”.
————————————————————————————————-
The is not compiling !!!!
Can you paste the error message here?
hash2.cc: In member function ‘std::size_t std::hash::operator()(const Line&) const’:
hash2.cc:35:43: error: passing ‘const Line’ as ‘this’ argument of ‘float Line::getM()’ discards qualifiers [-fpermissive]
return ((hash()(k.getM()) ^ (hash()(k.getC()) <> 1);
^
hash2.cc:35:70: error: passing ‘const Line’ as ‘this’ argument of ‘float Line::getC()’ discards qualifiers [-fpermissive]
return ((hash()(k.getM()) ^ (hash()(k.getC()) <> 1);
^
hash2.cc: In function ‘int main()’:
hash2.cc:11:11: error: ‘float Line::m’ is private
float m;
^
hash2.cc:54:19: error: within this context
cout << t.m << " " << t.c << "\n" ;
^
hash2.cc:12:11: error: 'float Line::c' is private
float c;
^
hash2.cc:54:33: error: within this context
cout << t.m << " " << t.c << "\n" ;
why do you choose exactly this function:
hash({a,b}) = (hash(a)^(hash(b)<>1
?
fix misprint: hash({a,b}) = (hash(a)^(hash(b)<<1
This is just one way of implementing a hash function. I used it to demonstrate how to use a hash function. You can experiment with other functions too.
Where are you using the hash in your example?
We use hashing when we define the hash structure in the example. The hash computation is happening in the following line:
return ((hash()(k.getM()) ^ (hash()(k.getC()) <> 1);
Yes of course. My question is not about where are you hashing, but where in the main program the hash is used. 🙂 Thanks for your example by the way, it helped me a lot!
You can implement a hash table using unordered_set container, but it doesn’t work with user defined classes. To overcome this restriction, you override the default behavior by creating your hashing function. So now it gets activated whenever you declare a variable as shown in the following line:
unordered_set t;
Now if you try to insert an item that already exists, then it will use the hash table implementation to check if the item already exists. When you run t.insert(line3), it will check the table to see if it exists. Since line2 is the same as line3, it won’t insert this duplicate item. Hope this answers your question 🙂