0:02

My key index counter is a great algorithm but that's not the end of the story.

It's also useful for creating a more general purpose algorithm for strings.

The first one we'll look at is called LSD radix sort, Least Significant Digit for

string sorting. And the idea is a very simple one.

We have strings so we're gonna consider now a small example where the strings are

all the same length. And again that's often true in practical

applications account numbers and so forth. String that uses, sort these, may all be

the same length. And what we're gonna do is consider.

The character positions in the strings, and move from right to left.

The algorithm, is to just stably sort using key index counting, on, the chara-,

deed character of the key, where deed goes from the right end and decreases.

So, this is, a stable sort. Of those twelve keys, sorting on the

right-most character, the B's go before the D's go before the Es.

It's crucial that the sort be stable in this application, and that's why we

checked with key index counting to make sure that it was stable.

So it's a stable sort on, on the right most character.

And then all we do is move from right to left and do now the second character.

And this now is a stable sort of those same keys on the second character.

So now the ones with A in the second character come before the one with B and

so-forth. And not only that, it's stable, so their

relative order is maintained, BAB, CAB, FAD, BAD, and so-forth.

And then. To finish this.

Sort. Now we do it on the first key.

And the magic of LSD radix sorting is eh, once you do it on the first key then the

strings are all sorted. So, that's three passes, one for each

character in the string. Each taking linear time and we get a

string sort. That's LSD sort.

Now we need to prove that it works. And so this is a simple proof by induction

that it worked. After we have done I passes then we can

assume by induction that the strings are sorted on the last I characters.

So we are just showing that for two, after two passes. it sorted on the last two

characters thats we're assuming by induction.

So now what about the next character that we are sorting.

Well, there's two things that can happen. If two strings are different on the first

key, then, the key index sound is gonna do the job.

A string that starts with B is gonna come before one that starts with a D, and so

forth. So if they're different on the sort key,

the key index sort puts them in order. If they're the same on a sort key then

stability does the job. So all the ones with that stay on order

because we've insisted on a stable sort. That's a simple proof by induction that

LSD string sorts fixed length strings in a descending order.

And it's really easy to implement. This is a complete.

Java implementation of LSD string sort. So we're explicitly working with radix

r256. =

256 and where the radix comes in, the value of the radix comes in, is that's the

size of the array that we use for the, counts and the accumulants.

We need one for each character. For each possible character value, we're

gonna index into that array that has to be that big.

And all this is, is the code for key index counting.

And then all we do is take a variable t that goes down from this is the strings of

fixed width w And we start at the rightmost character and go down to the

first character. And instead of dealing with a-ah, our

string A of I2 we're just look at the deed character which is a character.

Otherwise just with that replacement and that replacement it's the same code as we

looked at for key index counting. So let's do key index counting on the deed

character going down from the width from right to left.

That's remarkably compact code. And that's gonna be the method of choice

for lots of situations with fixed length keys as the sort key.

And, and it gives us another look at the performance of sorting out algorithms.

That gives us another line in the table that we're requiring that, that be, they

be fixed length keys, there's ways to work around that.

And we'll consider another algorithm that deals with that in a minute.

But again it's often or typically the case that the.

Width of the keys is not that long. It's a small constant.

And therefore, we have a linear time algorithm.

This even works if the keys are binary numbers represented in a binary word.

We can break them up into groups of small number of bytes Say, 64 byte number can be

broken up into 88 eight bytes characters or four sixteen byte characters.

For sixteen byte characters, W would be four and you can get a huge array of that

kind of numbers sorted in just the four passes through the array.

If they don't have the same length we have to do some extra work.

It's an interesting problem to think about.

We're, we're gonna look at a different method in a minute.

So here's the type of question that somebody might could ask for a job

interview. Actually, a web services company, every

day, might be in the position of needing sort a million or a billion 32 bit or 64

byte integers. And an algorithm student, in interviewing,

might could ask what sorting method do you use?

Now, senator. You hear Google and I like to think of the

presidency as a job interview. Now it's hard to get a job. as president.

Right. And, and you're going through that obviously now.

It's also hard to get a job at Google. Right.

[laugh] We, we have questions. And we ask our candidates questions.

And this one is from Larry Schwimmer. Okay.

[laugh]. You guys think I am kidding?

It's right here. What is the most efficient way to sort a

million 32 bit integers? Well I, no, no, no, no, no, I, I think the

bubble sort would be the wrong way to go. Come on, who told him this?

Okay. I didn't see computer science in the

background. We've got our spies in there.

Well. 'Kay.

Let's ask, let's ask a different interview [laugh] question.

[laugh]. So the bottom line is if you want a good

job maybe, you wanna know about LSD string sort.

Actually, this method has been around for, really a long time.

So we'll start with a little bit of a story.

So what did people do, in the nineteenth century when, they wanted to take a

census? And actually, the story is that, for the

1880 census, it was actually obsolete, before it was completed.

It took 1,500 people seven years to manually process the data.

So, during that time there was room for some invention, and a man named Herman

Hollerith developed an automated machine that could help do the census faster.

So what his idea was to use punch cards to record data.

The kind of data that was taken in the census.

And then the machine could tabulate the data by sorting one column at, at time.

And we'll look a byte at how it does that in just a minute.

And the idea was that the re-, result of that was that the next census finish

really much earlier and under budget, 'cause this machine automated much of the

process. And that had a really a profound effect on

the development of computing 'cause punch cards, it turned out were useful not just

for census but for many other applications.

For accounting and for many other for business processes and for many decades,

punch cards were the primary medium that was used to store, enter, and process

data. And Hollerith's company for building this

machine later emerged with a bunch of other companies.

And in 1924 that company became known as IBM.

And actually, punch cards were used up into the 70's' And even in the 80's' in

some places. So, if, it's important.

Let's take a little break and talk about the role of LSD string sort.

For you know, a couple of decades people who wrote programs were working with

punched cards. And in courses at universities, if you

want to write a program, you wrote it by putting one line on each punched card.

In your program, therefore, was a deck, a long deck of punched cards.

If you had a 1000 line program, you had 1000 punched cards.

They came in boxes that held 2,000, and people would carry around these boxes of

punch cards that, that were their programs.

To enter the program there was a thing called a card punch which had a, which had

a keyboard kind of like a typewriter, but all it did, you could see the cards, and

it'd actually punche holes in the card with what you typed.

Now there was a huge so then, you, you're program was punched cards.

And there was a machine called a card reader which would take the cards in and

convert those punches back into, into binary and characters that again, could be

processed on the computer and then you get your results printed out on paper in a

line printer. So for many, many years people programmed

by making decks of punched cards handing them to an operator who put them on a card

reader and then waiting for the printed output to come out.

There were other devices, but this was the main thing for a long time.

And lots of people learned to program this way.

Now, there was a huge flaw in the system though.

The flaw was, if you dropped the deck, then your program was completely scrambled

and out of order. And you had no program.

You had to go through the cards and find the first line and, then find the second

line. Well, people figured out to work around

for this really almost right from the beginning, cause this is clearly

12:25

intolerable, situation, and, the, along with the same room with the card punch,

There was a thing called a card sorter, And the card punch did one other thing

automatically. every time you punched a card.

It would go to the last six columns of the card and it would put in, a number.

Actually they skip by ten So, it would be. The first card would be card ten then

twenty, 30. And your cards would be numbered up to six digits, so you could

have thousands of cards sequentially. So when you typed in your program, you get

the cards numbered, in order. If you wanted to add a few lines to a

program, you had room to add a couple of numbers and, re-punch the numbers, but,

nuh, the whole point was, that all the time when you're holding on to a card

deck, the cards are in order, by number on the right hand column.

And if you dropped it, all you needed to do was sort it.

Or if the m-, machine operator dropped it, that was not viewed as a big deal for your

cards to get [laugh] out of order, because there was this machine that could sort

cards. And the way that it worked was, LSD Rate

exort, the LSD string sort on those characters that are the numbers that keep

the card in order. They would, you'ld start on the right

column. And there was a physical thing, you'd set

to a call and, it was gonna sort on, you put the deck in, and it'd distribute, the

ones with zero in the first bin, or ones with one in the second bin, all the way

up, it'd, and of all the cards it start with zero would come in, and it was

stable, whatever order the cards were in, that's the order they'd appear in the

pile, and then you'd pick them up, and you'd have a new deck, and it'd be all

sorted on the right-most column. Then you move over for one position from

right to left, and run the cards through again.

So if running the cards through the card sorter six times, you could get your deck

sorted. So, every programmer knew LSD radix sort.

For decades, it was not something that was, difficult to teach, 40 years ago.

And, these, this equipment, is now all pretty much gone.

But LSD radix sort, is still a good algorithm to know. Not related is

something else that was going on with those initials at that time.