Jump to content

Hashing


Guest X-Fusion

Recommended Posts

Guest X-Fusion

Most scripters already know that using hash tables is faster than using other methods of storing information, like using .ini files. A lot of people, though, don't know why exactly hash tables are faster.

 

Part of their speed comes from their residence in memory, while .inis--unless cached-- reside on the hard disk. Operations that deal with reading from or writing to disk are faster than operations that deal with memory.

 

But that's not all of it. If you had a cached .ini file and pitted it against a hash table that contained the same information (using item names that combine section and item in .inis), hash tables would still be faster on average.

 

Why is this? It comes from the way a hash table is set up; it's a data structure, and not simply a file that gets parsed every time you need to access it.

 

Linear Search vs. Hashing

If you had to write an alias that does what /writeini and $readini do, you'd probably do it the same way I would: /writeini would search through every line of the file, and overwrite the appropriate line when it found it. At the end, if it didn't find the line, it'd add a new one under the right section. $readini would scan through every line, and return the right line under the right section. If it didn't find anything to return, it'd stop and return $null.

 

This is an example of linear searching, so called because the amount of time it takes to execute is based (roughly) on the amount of lines in the file. If you had to graph execution time, you'd get a line.

 

Now, this isn't a problem if you have small files or a lot of time. But as files get bigger, writing and reading take longer, especially if the section is at the end of the file.

 

So, if hash tables don't use linear search, what do they use?

 

They use hashing. [Note: this has nothing to do with $hash().]

 

What is hashing?

The best way to describe hashing is to imagine a bunch of buckets. Say that we have five buckets, numbered 0 to 4, each filled with index cards. On the front, the index cards have a phone number; on the back, they have a name.

 

Now saw we want to find out who has the number "8675309". We could either do a linear search, looking through each bucket until we find the right card, or we could cheat.

 

Before we set up these buckets for holding phone numbers, we establish a system that tells us where to put certain phone numbers. We create a hash function that gives us a value based on the phone number. We then divide that number by 5 (the number of buckets) and take the remainder. This way, we know what bucket to look into, based on the number that we're looking for.

 

But wait a second. Why are we taking the remainder after dividing by 5 (or simply f(x) % 5, where f(x) is our hash function)? This makes it easier to use our hash function for phone numbers later; we have it return simply a value, not a bucket number. This way, if we wanted to add a sixth bucket, we wouldn't have to change the function, just change the divisor.

 

So now, before we add an index card to our buckets, we have to figure out which bucket it goes into. And when we want to take an index card out, we have to do the same thing.

 

But it's easier to search through one bucket instead of five. Imagine if "8675309" was in the fifth bucket, at the bottom. We wasted all that time, when we could have just set up this hashing system.

 

Hash Functions

Now, if we want to implement this phone-numbers-in-a-bucket system, we really ought to figure out what hash function we want to use. We could add the first and last number of the phone number, or we could add the the middle three, or we could just return the middle number. These are decent hash functions.

 

A bad idea is returning its length (this applies only if phone numbers in your home country don't have differing lengths--I realize some places like the United Kingdom have differing lengths based on regions' populations, which is more logical than United States' numbers, but that's not the point of this tutorial). A worse idea is returning a value based on the time, or the date, or something else like that.

 

Why? Because, we want to get the same value out of our function at all times. If your function returns $ctime, we will search in the wrong bucket most of the time.

 

Pulling It All Together

Let's assume that we want to script a "replacement" for hash tables in mIRC.

 

We establish our hash function to be the sum of the first and last numbers in the phone number.

 

How do we set up our buckets? As messy as it would be, I would suggest hidden @windows. The way I've seen them done in "real" languages is through using an array of queues, but this is the simplest way to implement hash tables in mIRC.

 

So how does our replacement work?

 

On /hmake N, we create N number of buckets. (I bet some of you have wondered why exactly we have to specify a number of "slots." mIRC allows you to vary the number of buckets your table contains.)

 

On /hadd , we compute the bucket number, and /aline the item and value to the @window. On /hdel , we find the right bucket, search through the lines in the @window, and delete the right line.

 

On $hget(,), we find the bucket number, find the right line, and return it.

 

The other functions aren't completely necessary, but I'm sure you could figure out a way to implement them as well.

 

Conclusion

Hopefully, after reading this you've come to see how cheating and knowing where to look for data is a lot faster than searching through an entire file (or another linear data structure). Even though the last search is linear, searching through ten items out of one-hundred is faster than searching through all one-hundred.

**Edit .. Made by Nukem**

Hit the post reply button before credits.

Edited by X-Fusion
Link to comment
Share on other sites

...Operations that deal with reading from or writing to disk are faster than operations that deal with memory...

 

I think you meant this the other way around. Operations that deal with disk are slower than operations that deal with memory.

 

Link to comment
Share on other sites

  • 1 year later...

No.

 

hmake <name> N

 

A hash table can store an unlimited number of items regardless of the N you choose, however the bigger N is, the faster it will work, depending on the number of items stored.

 

 

Link to comment
Share on other sites

yea, the bigger N is, the faster it could work, and the more memory it could take.

 

Ultimately, you'll find that depending on the particular job at hand, different processes will have different effects. And mIRC has so many ways of processing information that at times, the things you think are slowest, but be just as effective, and less memory consuming.

 

You must remember that resources are a limitation, you cannot just think about speed as an indepedent variable, it is dependant on the capability of the computer, how much it can store, both in it's memory and in it's drives. This is something you especially need to consider when dealing with public release scripts, where you'll get people with all different computers using the same process.

 

If you get into object orientated programming, and perhaps even data structures, you'll understand that different data structures can yield in different methods of searching, and these different methods should result the same, but at varying speeds. I'm not sure what method mIRC uses, but if you google things like "binary search tree", "heaps", "treaps", you'll begin to get an idea of the actual theory behind it all. And not just quickly say... this method is better than the other, without knowing the actual processes behind it.

Link to comment
Share on other sites

  • 9 months later...

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...