Searching, Sorting, and the Memory-Time Tradeoff
Written 2008-06-08
Tags:Data structure Sorting algorithm Programming image processing Data type
Case A: Suppose you needed to search through a black and white ( real B&W, not greyscale ) picture and pull out a list of white objects from the black background?
Case B: Suppose you needed to sort 50 Gigabytes of unsigned shorts?
Most people will tell you that the limit for sorting is n*log(n). But that's not really the whole story. The hard limit to any sorting algorithm is actually linear time. I say this, because in Case B, you could sort them rather easily if you knew where to put each item, irrelevant of the others. Suppose there are ten numbers. I look at the first number, and if I instinctively know where to put it, I can put it where it needs to go, and continue sorting the array in the same way. Many, many problems do not lend themselves to this type of solution, but Case B, and in a way, Case A both do.
Solving Case B: There's 50GB of unsigned shorts. An unsigned short ranges from 0 to 65535. Great for postal codes, not so great for much of anything else. An unsigned short also happens to take up 2 bytes, so there are 25,000,000,000 unsigned shorts to sort. Assuming they are all the same (worst case), we'll need a data type that can hold all of them...how about an unsigned long long(ULL)? A ULL can hold 18446744073709551616 distinct values, and there happens to be a prebuilt structure for them in gcc. Now that we've established some definitions...on to the game plan.
So we read in the first unsigned short. For our purposes, a 1337. By it's very nature, we can make some inferences about where to put it. Assuming there's an even distribution, we can assume it'll end up somewhere on the left of the mean...or better yet, we can use its absolute position.
I've posted example code and runtimes here. If you look at the file 'run' you'll see the number of unsigned shorts sorted, and the second column is the seconds required to sort them. Keep in mind that each unsigned short also needs a call to rand(), which takes up the majority of processing. Also keep in mind that this example doesn't reassemble the data, which is trivial, but also builds into the time complexity MAX(linear term, constant term).
Anyways...how does this help solve Case A? The hard limit of A is linear against width, and linear against height. Assuming a fixed ratio (in this case 4:3), the hard limit is quadratic against either.
The first algorithm I thought of was pretty simple, and horrendously slow. Lets say each pixel will belong to a "blob" and each "blob" can have multiple pixels. In my case, I used an STL vector.
The first addition is to make the algorithm non-recursive. Ryan Meuth used this in his approach and solved the 'U' problem (finding two parts of a blob before finding the center) by the addition of a lookup table.
The second fix is to store each blob as a 2-dimensional array of pixels. In fact, if you can spare the memory, allocate an entire image-worth. To check if a pixel is adjacent to a blob, you can take its position in the array, and look around it to see if there are other set pixels. In this way, a lookup on a blob now has a constant time of eight lookups. Inserting a pixel into the blobspace will still be linear over the number of blobs, but as far as I can tell, that can't be helped. We still have one small problem: each blob has its pixels stored in an obnoxiously hard-to-use format, being a picture. The solution is simple: create a hybrid structure. Each blob has two containers: a matrix and a queue. Insert a pixel into the blob, and you insert it into both. To obtain a list of all pixels in a blob, simply process the queue.
Case B: Suppose you needed to sort 50 Gigabytes of unsigned shorts?
Most people will tell you that the limit for sorting is n*log(n). But that's not really the whole story. The hard limit to any sorting algorithm is actually linear time. I say this, because in Case B, you could sort them rather easily if you knew where to put each item, irrelevant of the others. Suppose there are ten numbers. I look at the first number, and if I instinctively know where to put it, I can put it where it needs to go, and continue sorting the array in the same way. Many, many problems do not lend themselves to this type of solution, but Case B, and in a way, Case A both do.
Solving Case B: There's 50GB of unsigned shorts. An unsigned short ranges from 0 to 65535. Great for postal codes, not so great for much of anything else. An unsigned short also happens to take up 2 bytes, so there are 25,000,000,000 unsigned shorts to sort. Assuming they are all the same (worst case), we'll need a data type that can hold all of them...how about an unsigned long long(ULL)? A ULL can hold 18446744073709551616 distinct values, and there happens to be a prebuilt structure for them in gcc. Now that we've established some definitions...on to the game plan.
So we read in the first unsigned short. For our purposes, a 1337. By it's very nature, we can make some inferences about where to put it. Assuming there's an even distribution, we can assume it'll end up somewhere on the left of the mean...or better yet, we can use its absolute position.
We're going to need an array of counters for this next trick, something like "unsigned long long counters[65536];" And we'll just increment the 1337 counter by one. For the next 24,999,999,999 unsigned shorts, we'll do exactly the same thing. Now we have an array containing counts from one to (at worst) 25,000,000,000. All we have to do is traverse the array, reassembling the sorted structure.So how long would it take to sort 50GB of files holding Unsigned Shorts? About as long as it takes to read the unsorted files and write the sorted files. I discovered this algorithm last week on a robotics competition, while trying to solve Case A. (it's actually been done before-right here is a great paper on it-I wasn't the first. I was a little disappointed.) But anyhow, it has some great features:
- Time complexity is linear for large collections of objects, or constant for small collections.
- It is easily extended to sort complex data structures, although there are some rules.
- It's a stable sort for larger objects, if implemented properly. (Use an array of queues instead of the counters)
- It requires an onto mapping for all objects to be sorted
- It works great on standard memory, but performance degrades for files or NUMA machines, because the underlying data structure must contain all objects in ram.
- The characteristic to sort on must be limited by a maximum value, and that maximum value should be small in order to prevent the constant time component from growing too large.
- Only works well for large collections of objects. For most collections of objects smaller than the range of the characteristic to sort upon, you're probably better off with introsort or one of it's ancestors.
I've posted example code and runtimes here. If you look at the file 'run' you'll see the number of unsigned shorts sorted, and the second column is the seconds required to sort them. Keep in mind that each unsigned short also needs a call to rand(), which takes up the majority of processing. Also keep in mind that this example doesn't reassemble the data, which is trivial, but also builds into the time complexity MAX(linear term, constant term).
Anyways...how does this help solve Case A? The hard limit of A is linear against width, and linear against height. Assuming a fixed ratio (in this case 4:3), the hard limit is quadratic against either.
The first algorithm I thought of was pretty simple, and horrendously slow. Lets say each pixel will belong to a "blob" and each "blob" can have multiple pixels. In my case, I used an STL vector.
So to start, search through the picture until a white pixel was found. Set the pixel to black, check if it can be stored in a known blob (if a blob is adjacent to it), otherwise create a new blob, and recursively scan the 8 pixels around this found pixel.This algorithm has two failing points. The first, is that the recursive factor will scan each pixel more than once. This isn't optimal, especially if we're looking to make this go fast. Secondly, since my blobs are made of vectors, not only must we search through each blob, we must search through every single pixel. Thankfully, I added a little hack, some range checking. Each blob knew the ranges it spanned(MaxX, MaxY, MinX, MinY), and so often checking an entire blob for adjacency with a pixel would only take a few operations, but other times it took an entire linear search. I could have sorted each blob internally by some characteristic, say X or Y, but this is just building on a broken machine, like ripplesort is to bubblesort.
The first addition is to make the algorithm non-recursive. Ryan Meuth used this in his approach and solved the 'U' problem (finding two parts of a blob before finding the center) by the addition of a lookup table.
So, in much the same way, search through the image, creating new blobs if needed and appending to existing blobs. But, if you ever find a pixel that is adjacent to two blobs, mark it in the table and combine them later.This is a little slower, because each pixel has to be compared against every single blob, regardless of a previous match, but it is much faster than my recursive solution above. What may perform even faster, is combining blobs on the spot.
The second fix is to store each blob as a 2-dimensional array of pixels. In fact, if you can spare the memory, allocate an entire image-worth. To check if a pixel is adjacent to a blob, you can take its position in the array, and look around it to see if there are other set pixels. In this way, a lookup on a blob now has a constant time of eight lookups. Inserting a pixel into the blobspace will still be linear over the number of blobs, but as far as I can tell, that can't be helped. We still have one small problem: each blob has its pixels stored in an obnoxiously hard-to-use format, being a picture. The solution is simple: create a hybrid structure. Each blob has two containers: a matrix and a queue. Insert a pixel into the blob, and you insert it into both. To obtain a list of all pixels in a blob, simply process the queue.