Monday, September 18, 2006

violent overactive debugging -- c++ STL nightmares

this may be considered a draft:

ok, let me see if I can come to some conclusion about this half-nightmare bug-hunt that I came across recently

as some of you know, one of my tasks at work consists in maintaining a simulator of kilo-instruction processors. The other day, while playing with a new implementation of load-store queues, I came across this funny segfault. My program makes extensive use of c++ STL and the particular module in question makes extensive use of list<>'s and multimaps. At some point of the program it was calling equal_range() to traverse a set of instructions with the same address. If you think about this you'll see that a multimap is the perfect structure to simulate a memory disambiguation queue. equal_range() returns a pair of iterators. In pseudo-c++ it looks like this:

pair <> it_pair = disamb_queue.equal_range(load->address())

what happened is that traversing the resulting iterator would result in a segfault as one instruction would have an invalid memory address. Later analysis turned out that the memory address of the instruction (which was 0x3 btw) was obtained from the number of elements in the SQ, a local variable present in the store queue. Well, looks like the typical horrible memory management problem. Unfortunately, analysis within gdb was difficult as the debugger refused to de-reference the iterator structure while complaining that it had an uncomplete type. So I had to actually guess the distance of the pair of elements from the node head, which turned out to be 24 bytes. But let's better forget about that.

Given that my debugger experience was being very complicated I implemented a debugging function that just runs equal_range() and traverses the elements of the structure. This allowed me to track down the moment at which the program started to break. Apparently this was happening very close to the end, but why exactly was still a mistery.

I knew this must be a memory management problem so I turned to glibc malloc hooks. In linux, glibc memory management can be tuned by setting up an environment variable called MALLOC_CHECK_. You can set this variable to three values: 0,1 and 2. 0 instructs glibc to ignore the error, 1 tells glibc to print a diagnostic and continue and 2 instructs glibc to immediately call abort() when a problem happens. I had used this technique before with success but this time I ran into problems. Running with MALLOC_CHECK_=1 returned a double free() call, but within the debugger this was not reproducible, so I could not find the moment of the double free(). And yes, the debugger was running with malloc hooks, at least it printed the messages in the terminal. Using the abort() method also didn't work for some reason. abort() seemed not to be called. I particularly like the abort() method - when it works - because it allows you to just break at abort() within the debugger and print a backtrace of what was happening at that moment. But anyway, it didn't work this time.

My dispair was pretty large at this point. Debugging STL is something no person would do in his sane judgement (except maybe if you are a STL developer). So I went one step further and tried out a tool that I had avoided for too long: Valgrind. I had always thought that valgrind was something similar to electric fence, ie, a library that substitutes malloc/calloc/free and changes alignments to give you more information. I was very confused! Valgrind is actually a cpu emulator that tracks validity of all memory data and addresses. This allows it to detect a number of common errors: reads of free'd data, using non-initialized data, off-by-one errors, etc. The tool is beautiful; the only problem may be its slowness - around 20 to 30 times, something normal if you consider it is an emulator. Fortunately my error appeared within the first 20 seconds, so I could run valgrind and have the result in a couple of minutes. Valgrind is cool, but interpreting its error messages is not trivial at first and it has the problem that even though you can track the error to the exact line in the source code, there is no way to know when exactly it happened. Thus I cannot just reproduce the error condition inside the debugger. CPU simulators are in general cycle based, and calls to the same functions are happening continuously.

What Valgrind did tell me was basically that equal_range() was reading data that had been freed earlier in a different function. Valgrind actually tells you which function free'd the data. This feature is really critical! Thanks to this information I was now able to start the traditional method of fprintf() based debugging :) In the end I found the error in a mismatch of insertions/deletions based on a subtle different between two different simulators I amb maintaining (a subtleness I had never come across before) and now I have been able to fix this error. This story may be the daily bread of many developers, but I wanted to post it because it fascinated me the number of different debugging techniques I had to use to actually solve this problem. There were at least 4 techniques involved (gdb, MALLOC_CHECK_, valgrind, fprintf()) and every technique allowed me to go just one step further. fascinating...

After I solved this bug today I started to think about writing a book about debugging c++. of course I'll never do this, but being able to use debugging tools can really help a lot. As an (ex-)teacher I have always been disappointed about how far debugging tools are neglected by course programs and by the students themselves. but this is a topic I'll not touch here :)

good night!

Monday, September 11, 2006

internet adventures part II

hi there!

well.. I finally have some good news: internet is working and this time completely.

after the last post, I spent the morning searching information on what had happened to me and I found a more advanced forum of people who had been able to go past the "set adsl open dmt" issue. The funny part, I found out, is that most of those people had also observed the large data burst problem (though they referred to it as the problem of "sites with large forms"). Someone suggested that a MTU of 576 should work best with the PPP mode we are using. He apparently found that information on wikipedia. I tried it, and in fact some pages started to work (gmail for instance). But not all.

So finally I tried to configure for the second time the Comtrend ADSL modem (software version A101-220TLF-C35) of telefonica that a friend let me. I searched for the user/pass (which is 1234/1234) and entered the parameters. At first it didn't work but with some additional modifications to the standard configuration it did. These changes consisted in:

1) enable G.DMT modulation together with ADSL2+ and T1.413
2) select "outer pair" instead of "inner pair" as the phone line pair

the first change was pretty obvious as that is what I had been told with the previous modem. However, the second was more sort of luck. Telefonica distributes its modems with a couple of microfilters. However my installation at home, which was done about 5 years ago does not use microfilters and instead relies on a splitter at the entry point. I don't know if there is any relation between these but this fact made me change to "outer pair". maybe I should take a look at this issue more closely :)

well, the important thing is that I can work without problems now. Some day it seems that ya.com will send the new modem and at that time I'll have some new adventures to tell :)

ok.. hasta la vista then!
-- izzu

Friday, September 08, 2006

my internet adventures

Hi there!

It's reasonable to think that complexity increases as technology advances, but the adventures I have been having with my ISP as of late really surpass my capacity of understanding.

Let me explain you this little story. First some background. I'm a DSL user located in barcelona, spain. Our servide provider is ya.com and it has been for about 5 years now. We've been very happy all this time, with minimal downtimes and continuous increases in throughput (we started at 256kbps and are now at 4mbps).

Last year (2005, that is) ya.com bought a fiber network called albura. Not many people talked about this but it's basically the beginning of my small horror story. Thanks to the albura network (AFAIK) ya.com has been able to offer 20mb download rates using the so-called adsl2+ technology. Ya.com clients from 2001-2002 did not use this network and instead relied on lines that were leased from Telefonica, spain's main operator. This circumstance prevented our migration to adsl2+. Thus we were still using adsl2@4mbit speed in July this year. and btw our modem router was the ancient 3com 812 officeconnect which does not support adsl2+ anyway.

In july 2006 I started observing strange behaviour of my modem. In general it would work, but periodically the modem would stop working for some time, apparently rebooting itself (although this I did not notice). It mostly worked so I did not bother too much. However, on July 26th (more or less) it stoped working completely. The modem was still synchronized but there was no way to send-receive packets. Since I was on a trip I could not check these facts and instead thought that I'd be able to easily correct the incident after returning home.

But once I was back home things just got worse. I could not make the modem work despite my best tries, and worse, the modem suddenly lost synchronization and stayed so. At this point (I guess Aug 1st or 2nd) I contacted tech support, who provided me with some new configuration parameters (which for some reason nobody had told us we had to change). But this did not solve the problem. We were told to wait for technical staff to contact us on our problem.

Since I didn't want to wait I made a small web search and imagine what I found. Apparently ya.com was in the process of migrating all 2001-2002 users to the albura network, but something went wrong and lots of people in my area were having the same problem. What follows is a couple of weeks trying to contact tech support without luck. Some three weeks later (!) finally a technician contacted me and told me that 1) ya.com people had checked the remote modem station and that 2) a new router was needed for the adsl2+ technology, which would be sent in shortly after. Then he told me that even though my modem was not adequate we could use a patch to make it work. This patch consisted in telneting into the modem and issuing the commands:

set adsl open dmt
set adsl reset

Back at home I tried, and effectively it resynchronized after a minute. Interestingly I had to reconfigure the modem to act as default gateway, something which I had never modified but suddenly was not active anymore. Anyway.. it was working and after 4 weeks I was happy again. But now is when the most perplexing starts, a new problem that I still haven't been able to solve at this point.

With the modem synchronized everything seemed to work well. Everything except gmail. For some unknown reason, after logging into gmail I am completely limited. Emails can only occasionally be viewed, sending is impossible and chat also doesn't work. And this happens on all computers at home. ALL! So it must be something of the network. But what? I have checked that both TCP and UDP work and I even played with modem parameters such as the MTU and other parameters that I can't remember. It can't be a software problem since the same computers that exhibit this problem work perfectly in a different network. I have tried to google for similar experiences, but without luck so far. It's so perplexing that I don't even know if I would be able to explain the problem correctly to a technician.

I think that this maybe has some relationship to the technologies used by gmail. I tried to find out something. GMail uses mostly AJAX to minimize transfers between client and server. AJAX seems to contain three components: XMLHttpRequest+JavaScript+XHTML (simplified). Individually searching for problems with these technologies also has yielded no answer as to why I may have problems accessing the site from my network. Maybe it is google that doesn't like our network address?? Or could it be that the "patch" does not provide full internet service??

As said, I'm perplexed. Luckily I do most interneting from work, where I do not have this problem, but it is still annoying getting home and not being able to access gmail other than through the POP3 interface.

So far my horror story :) If you somehow came here and have a suggestion I'd really like to hear it.

In the meantime I'll get back to some reading/hacking :)
ByeBye! -- izzu

UPDATE:

What you just read is what I wrote yesterday at night. But I couldn't submit it to blogspot. Imagine why. However, the incident has allowed me to confirm a previous hypothesis that I had discarded: The problem with my DSL line lies in the fact that the ATM/ADSL network cannot support large bursts of outgoing data. This is what I thought when I started to play with the MTU parameters, but my tests didn't work back then. Thus I had since discarded this idea. But yesterday, after seeing that I could not submit the post, I tried to upload it to a server that I can access from work so that I could send it today (and this is what I'm doing now, btw). My surprise: the sftp command stalled. When this happened I was reminded of my previous hypothesis and decided to test it. Luckily sftp has a switch (-B) that allows you to specify how large the blocks of data are that are going to be sent over the network. I tried with -B 256 and it also stalled, but it went further. Then I tried with '-B 16' and I was able to transfer the document. After noticing this I have been playing a little with the UBR and CBR parameters of the ATM line but I have had no success. I think I will try to find some info on this in the web or give a second chance to an adsl2+ router that a friend let me. Anyway, as you can imagine, the worst of all this is not the internet problem itself but the nervousness of days and days of tuning router parameters without success.

Well, back to work for today
-- izzu