In late January, I started working on a new research project. Without getting into too much detail, it involves network analysis on blockchain-based systems. My opinion on blockchain tech has not noticeably improved.After some initial results, my advisor & I decided to target ASONAM 2019 for publication, which has submissions in mid-late April. The exact, final deadline fell on the 28th this year.
I didn't make it.
I would like to briefly explore the reasons that I missed the deadline. Like prior deadlines I've missed, there is some element of research being simply unpredictable. However, there are larger errors that I made that would be valuable to not repeat.
I Was Indecisive
The first, and in my opinion largest, mistake boils down to indecision. While the project started in late January, I never committed to a single research question. Rather, I continued with various bits of exploratory analysis up until very shortly before the deadline.
While much of this exploratory analysis holds small bits of value, it doesn't come together as a cohesive whole. It isn't a paper. I knew this, experientially, from the project I spent most of last year on. In that project, I did a lot of initial analysis that never ended up in the paper despite informing my understanding of the problem space. Some of that will be going back in as I finish dealing with revisions for journal publication in the next couple of weeks. Despite knowing that much of my exploratory work would find its final resting place on the cutting room floor, I persisted with it up until about 12 days before the submission deadline.
Some of these results were quite strong, One of our key results is that prior network analysis had missed a key behavior native to Bitcoin (one-time-use change addresses), which skew the "user" network substantially unless corrected for. which gave me hope that I'd still be able to put together something compelling. Had the small extension I spent several of those 12 days on panned out, I might be writing a different post. However: it didn't, and I'm writing this post because those results were not enough, and I didn't commit to this idea until it was too late to fully realize it.
I Was Passive
As I worked through my analysis and began forming the project, it became clear (to me) that the initial direction that I'd worked out with Dr. Thai wasn't going to be viable. There were several alternatives, and earlier in the semester I suggested each of them in turn to Dr. Thai in our meetings.
However, I didn't do so forcefully. I'm saying now that the direction was inviable, but in early March I was not nearly so clear. I danced around the point, and when she pushed back against it I wilted. Had I been clear, I could've easily been spending six-to-eight weeks fleshing out one of these more promising alternatives. It may still not have panned out, but it at least wouldn't have felt like as much of a waste.
I Did Not Understand the Limitations of My Tools
Over the past year, I have spent a lot of time working within the pandas ecosystem. It's been great! Pandas is a great library for many tasks, with a flexible API that has allowed me to do a lot of analysis much more quickly than I would've been able to otherwise—especially when paired with plotnine to quickly generate complex visualizations. However, pandas has a bit of a problem with large data.
Specifically: pandas' memory usage is highly variable, difficult to predict,
and impossible to control. An operation may have minimal memory overhead and
take less than a second to compute—but a small modification to it may instead
take hours to run and result in a deep copy of some or all of the data. When the
first copy of the data clocks in at 120GB, doing a deep copy automatically
slows things to a crawl, and very quickly led to OOMing the server. The most
common culprit was the
.groupby(...) method, though I had issues with some
chained aggregations via
.apply(...) as well. Unfortunately,
is a fundamental operation necessary for my work, so many of attempts to
finalize results in the final days before the deadline simply fell apart.
Long nights and wasted hours could've been recovered if I'd realized the cause
of these memory issues. While I'd encountered performance
issues with pandas before, I had largely attributed them to hitting slow paths
in what is ultimately a Python library. During this project, I stumbled upon
this post by
Pandas creator Wes McKinney that hits on the reasons for many of the issues
I've faced. As useful as Pandas is, it became clear to me that it currently
isn't going to be a viable option for analysis of this particular
Not that I'm giving up on Pandas. In fact, I still use it
heavily. Rather, I now am better-equipped to identify which problems it is
ill-suited for. This one in particular has (a) a large memory footprint, and
(b) a heavy reliance on
groupby(...) operations for reasons intrinsic to the
data. The combination of these two means that pandas is simply not the
Ultimately, I ended up rewriting many of these bits of analysis in Rust
and doing the aggregations manually, then loading and plotting the results.
These simple Rust programs took were not too difficult to write thanks to the
hdf5 and csv libraries, and
even with repeated data loading they are substantially faster than my
Python/pandas code. This let me complete part of my analysis, but ultimately
I lost too much time to struggling with
MemoryErrors to be able to complete
all of it.
Despite this failure, I'm not particularly upset. I am frustrated, but am trying to channel this frustration into dealing with the problems I faced productively. I am particularly glad that I have had an excuse to basically ignore research work for the past week, between grading exams and preparing to teach a summer class. It has given me time to reflect on the factors that led to this failure and realize that—even though it is my fault—it is something that I can learn from and improve on subsequent papers.