Interview question: Remove duplicate characters

Let's assume that we are asked to solve the following problem in a programming interview: "Write a function that removes all duplicate characters from a string". Piece of cake, you might think. Surely they are not asking us to write a Web Browser but there must be a reason for that.

Test first

The interview's time is limited. But this doesn't mean that we should start writing whatever implementation comes on to mind directly. Initially we should at least show to the interviewer that we are familiar with techniques such as Test Driven Development (TDD). The test cases are useful even if the interviewer is not interested in seeing unit tests. So let's write them down. We should make sure that our function:

Doesn't fail with empty strings (so "" is valid input)
Returns a string of length = 1 as is (for example "a" is returned as "a")
Returns a string of length = 2 with two same characters as a string with a single character (for example "aa" is returned as "a")
Removes all the duplicate characters (for example "anakonda" becomes either "ankod" or "konda" depending on which indexes of duplicate characters we choose to remove ; I'll initially use the second case and later switch to the first one)
Returns a string with no duplicate characters as is (for example "the" is returned as "the")

It's fine if we forget some test cases, we can extend them later. Now, let's assume that the interviewer insists on actually implementing the tests. I'll use python, because writing unit tests in python is straightforward -- and I obviously like the language ;)

First we need a stub version of the function. Our function, called remDupl will accept a string as input and return a copy of the input string without duplicate characters.

Now we are ready to write our test.

Our test is already useful. We can make sure that it fails and then write the main logic of our function until it passes the test.

Our test fails, as expected

The naive implementation

So far so good. It's time to start implementing our function. Since the time is against us, the first step is to follow a naive approach. Although the first implementation might not be the most efficient and clean, it's important to get something working. If there's enough time we can improve it later. If there's no time for writing an improved version, we can still discuss with the interviewer the pros and cons of our naive implementation.

First of all, does our function behave as expected?


Test result of the naive implementation

Looks good :)

Improving the code

What's the time complexity of our algorithm? It's O(n), since we have to iterate through the whole string. Even if it's hard to improve the time complexity, we can do better by writing our code in more idiomatic python. Python offers a set data structure which looks like a good candidate since all items in a set are unique by definition. But unfortunately the items of a set have no particular order therefore we cannot simply create a set out of our input string. We need a trick to preserve ordering. Combining the set with a generator expression and converting it to a string gives the expected result.

Nice, let's test our changes.

The test fails, only the first occurrence of a character goes into the set

Hmmm, there's something going wrong here. We need to update our test case, since the new code keeps only the first occurrence of each character (thus "anakonda" becomes "ankod" instead of "konda" that used to be when replace was used).

Now that we updated the test case, our code should pass the tests. Let's verify.

The new code passes all the tests

And we are done!

Idio(ma)tic?

The new version of our function is more compact and pythonic, but does it really perform better that the first version? Since both algorithms are O(n), the details are in the data structures and one way to find that out is by using a profiler. So let's see what's the performance when executing each version of our function with 600.000 strings as input.

Profiling shows that the naive implementation performs better

The second version is a backward step regarding performance. Profiling shows that the generator expression and the join function have the biggest negative impact. Although list comprehensions and generator expressions are the power of python and are optimized for performance, in this particular case the fact that we actually need an ordered string as the output result does not help.

References

wikipedia. Test Driven Development
python.org. Profilers
Mark Pilgrim. Dive Into Python 3

Interview question: Remove duplicate characters

Test first

The naive implementation

Improving the code

Idio(ma)tic?

References

Show me your Identity …

Intro

Posting to LinkedIn and …

Background

Dirty clouds