Monday, September 18, 2006

violent overactive debugging -- c++ STL nightmares

this may be considered a draft:

ok, let me see if I can come to some conclusion about this half-nightmare bug-hunt that I came across recently

as some of you know, one of my tasks at work consists in maintaining a simulator of kilo-instruction processors. The other day, while playing with a new implementation of load-store queues, I came across this funny segfault. My program makes extensive use of c++ STL and the particular module in question makes extensive use of list<>'s and multimaps. At some point of the program it was calling equal_range() to traverse a set of instructions with the same address. If you think about this you'll see that a multimap is the perfect structure to simulate a memory disambiguation queue. equal_range() returns a pair of iterators. In pseudo-c++ it looks like this:

pair <> it_pair = disamb_queue.equal_range(load->address())

what happened is that traversing the resulting iterator would result in a segfault as one instruction would have an invalid memory address. Later analysis turned out that the memory address of the instruction (which was 0x3 btw) was obtained from the number of elements in the SQ, a local variable present in the store queue. Well, looks like the typical horrible memory management problem. Unfortunately, analysis within gdb was difficult as the debugger refused to de-reference the iterator structure while complaining that it had an uncomplete type. So I had to actually guess the distance of the pair of elements from the node head, which turned out to be 24 bytes. But let's better forget about that.

Given that my debugger experience was being very complicated I implemented a debugging function that just runs equal_range() and traverses the elements of the structure. This allowed me to track down the moment at which the program started to break. Apparently this was happening very close to the end, but why exactly was still a mistery.

I knew this must be a memory management problem so I turned to glibc malloc hooks. In linux, glibc memory management can be tuned by setting up an environment variable called MALLOC_CHECK_. You can set this variable to three values: 0,1 and 2. 0 instructs glibc to ignore the error, 1 tells glibc to print a diagnostic and continue and 2 instructs glibc to immediately call abort() when a problem happens. I had used this technique before with success but this time I ran into problems. Running with MALLOC_CHECK_=1 returned a double free() call, but within the debugger this was not reproducible, so I could not find the moment of the double free(). And yes, the debugger was running with malloc hooks, at least it printed the messages in the terminal. Using the abort() method also didn't work for some reason. abort() seemed not to be called. I particularly like the abort() method - when it works - because it allows you to just break at abort() within the debugger and print a backtrace of what was happening at that moment. But anyway, it didn't work this time.

My dispair was pretty large at this point. Debugging STL is something no person would do in his sane judgement (except maybe if you are a STL developer). So I went one step further and tried out a tool that I had avoided for too long: Valgrind. I had always thought that valgrind was something similar to electric fence, ie, a library that substitutes malloc/calloc/free and changes alignments to give you more information. I was very confused! Valgrind is actually a cpu emulator that tracks validity of all memory data and addresses. This allows it to detect a number of common errors: reads of free'd data, using non-initialized data, off-by-one errors, etc. The tool is beautiful; the only problem may be its slowness - around 20 to 30 times, something normal if you consider it is an emulator. Fortunately my error appeared within the first 20 seconds, so I could run valgrind and have the result in a couple of minutes. Valgrind is cool, but interpreting its error messages is not trivial at first and it has the problem that even though you can track the error to the exact line in the source code, there is no way to know when exactly it happened. Thus I cannot just reproduce the error condition inside the debugger. CPU simulators are in general cycle based, and calls to the same functions are happening continuously.

What Valgrind did tell me was basically that equal_range() was reading data that had been freed earlier in a different function. Valgrind actually tells you which function free'd the data. This feature is really critical! Thanks to this information I was now able to start the traditional method of fprintf() based debugging :) In the end I found the error in a mismatch of insertions/deletions based on a subtle different between two different simulators I amb maintaining (a subtleness I had never come across before) and now I have been able to fix this error. This story may be the daily bread of many developers, but I wanted to post it because it fascinated me the number of different debugging techniques I had to use to actually solve this problem. There were at least 4 techniques involved (gdb, MALLOC_CHECK_, valgrind, fprintf()) and every technique allowed me to go just one step further. fascinating...

After I solved this bug today I started to think about writing a book about debugging c++. of course I'll never do this, but being able to use debugging tools can really help a lot. As an (ex-)teacher I have always been disappointed about how far debugging tools are neglected by course programs and by the students themselves. but this is a topic I'll not touch here :)

good night!

2 Comments:

Anonymous Anonymous said...

This comment has been removed by a blog administrator.

11:45 PM  
Anonymous Anonymous said...

Free Casino Gambling tyuueooru
http://stonewalljacksoncarnival.org/ - Casino Games
Online casino seems to take the industry by storm.
[url=http://stonewalljacksoncarnival.org/]Free Casino Game[/url]
Additionally, you don?t have to stand out of your casino till the clock hits its opening time because a majority of online casinos are accessible 24/7.
Free Casino Play
Online casino seems to take the industry by storm.

2:38 AM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home