Non-standard solutions to some technical problems

Posted on February 20, 2021 · Tagged with databases, logs, monitoring, analytics, postgresql, interviews

Intro

When I’m interviewing for contracts or for jobs, many times I get asked what are the hardest problems I’ve had to deal with. I’ll write below some of these problems as I remember them now.

Double-counting

Some years ago, I was working as a consultant in the analytics team of a top-10 Alexa website with hundreds of millions of views every month. I was tasked with finding a bug in the code that was counting page views.

The code was counting the number of views coming from different areas around the world. In order to summarize this vast amount of data, and bucket it into geographical regions, it would have to classify each city in the world to a certain region, and there were two main regions that we were interested in: Northern Hemisphere vs. Southern Hemisphere.

The bug itself would manifest in the totals row, for the percentages of views. Instead of having a grand total of 100%, we had 100.41%.

It was definitely a case of double-counting but nobody knew exactly where it was coming from.

The codebase was a fairly large 40k lines of code Perl written in a very arcane pre-Perl5 codingstyle. Tests were absent, and rewriting the code to be object-oriented in order to be able to tests different parts of the code in isolation was not an option given the size of the codebase. Checking all the code, and the arithmetic operations would have taken way too much time and effort.

Making changes to the code and re-running it on full or partial data was also prohibitive because there were hundreds of gigabytes of data to analyze in the full dataset.

After many days of looking at the code, and trying to understand where the problem was coming from, I realized that the only chance I had was to try to craft the simplest test possible.

Eventually I managed to craft that test: I wrote a program that would run the Perl code with a one-line log file, each time with a different city on the planet. Right after the Perl code was run, my program would check the reports generated to see if the percentages were correct.

The data is split up by a CSV mapping(a City→Hemisphere mapping). The planet was split into Northern/Southern hemispheres, which have cities allocated to them and some cities were placed in both the northern and southern CSV categories, resulting in counting the same numbers twice.

Running the code with one-line log lines, with just 1 city at a time, revealed that for some cities (on some islands in the Atlantic Ocean) the percentages were wrong(different than 100%), then grepping the codebase for those cities also revealed the presence of the CSV files in question, and that those cities were marked as being in both the Northern and the Southern hemisphere CSV files. After an e-mail thread where the suggestion of modifying the CSV file was discussed, the problem was solved.

In this case the bug was located in the data, and not the code itself.

Dashboard dependencies

When I was working a devops position in an online gaming company, I was handed a problem about dashboards being broken and reports in them not showing up anymore.

The way these were structured was this: Dashboards -→ Reports -→ Queries -→ Tables

So a dashboard had multiple reports, and reports were using queries, and in turn queries were accessing data from tables (all these relations we’ll call references or links later on).

The problem emerged after an upgrade of the visualization software involved (this upgrade was done 10 months before I had joined the company). Contacting the vendor support team resulted in an answer like: "Just upgrade to the next version" (even when presented with a full analysis of the problem). There were only two problems with that:

  • upgrading a running system with active users requires planning, approval, and sometimes having a replica to do the upgrade on in order not to affect production and users if something goes wrong during the upgrade

  • upgrading was exactly what broke the reports in the first place. so upgrading that system again could have caused even more reports being broken

I realized upgrading wouldn’t lead to a solution.

So I wrote code to dig into the metadata storage of the visualization software, reverse-engineered the undocumented structure of the metadata, copied the data over to an SQLite database (which I basically used as a graph database with nodes and edges) and came up with a GraphViz-based program that was able to track object dependencies across upgrades. This allowed me to then find the nodes that were pointed to by broken links in the dependency relations, and either re-create or update the objects in the current version of the reports in order for them to work properly (the changes performed were such that all references would point to valid objects).

At this point, visibility was increased, the entire hierarchy of dependencies was clear.

Having done all of this made it easy to solve the next task, which was to identify unused tables, by computing the in-degree of each such table.

Identifying WAL size generated by queries

When I was working as a DBA at a telecom company we had some spikes in the traffic of WAL logs in a PostgreSQL database that was replicated. This was caused by some queries which would generate a lot of WAL, and as a result there was more of it being transferred, it took longer for the replica to apply all the WAL and catch up with the master, and that was also causing a replication lag.

It turned out that I could actually use pg_waldump to get the transaction ids in each WAL file. In addition, I realized that if I monitored the queries that were running on the database servers using the pg_stat_activity catalog I could get the queries and their transaction ids. By joining these two pieces of information, I was able to find out how much WAL each query had generated, and then pick the outliers and report them back to the teams that had written them, in order to be refactored.

Conclusion

Sometimes seemingly hard bugs can be solved by crafting the right kind of test (which in turn shortens the cycle of trying out potential fixes). Sometimes they require crafting new tools that increase observability and visibility into the ways systems work.

If you liked this article and would like to discuss more about these topics, feel free to drop an e-mail at stefan.petrea@gmail.com.