Speed is a pillar for Graphistry engineering: The faster our graph visualizations and AI are, the faster analysts can untangle the relationships in their data, and the more they can achieve. Ironically, while the new extensions to PyGraphistry[ai] are designed to make analysts go faster by adding automatic graph AI capabilities (automatic feature engineering, …), the Python package import times slowed down to 10-20s. Worse, as our GPU server environments loads 1-2 instance of the library per CPU core, the system startup suffered heavy contention across CPU/IO/memory resources, and system initialization slowed to minutes or even started stalling out. We wanted to get the time back down to 1-2s. Luckily, with just a couple of optimization sessions, we did just that. The trick was two classic performance techniques: lazy loading and (Tuna) import profiling.
Lazy loading Python dependencies
Python coding standards recommend keeping all of your imports up top. That makes reading code easier and also aids fail-fast behavior for bad imports. But… Individual PyData ecosystem dependencies are increasingly 5s+ monsters and do not support targeted pay-as-you-go sub imports. Running all the imports that power a modern GPU-accelerated autoML package adds up! So, we instead now defer bigger packages to only load upon first use. For many users and many packages, that would be never. This optimization is called lazy loading.
A few culprits stuck out due to PyGraphistry[ai]’s recent new optional PyData AI dependencies like SentenceTransformers for our new automatic feature engineering pipeline and UMAP for our new automatic clustering pipeline. Solving a few obvious new imports like these would solve a lot of the import time. While other optimizations may have been possible, lazy loading some of the bigger new ones was an attractive target because of the nice mix of easy-to-do and a big predictable payoff.
We took a direct approach to lazy loading by rewriting the heavy imports as:
import torch print(torch.cuda.is_available(), 'GPUs!')
#normal imports ... def lazy_torch(): import torch return torch def has_gpus(): torch = lazy_torch() print(torch.cuda.is_available(), 'GPUs!')
torch import does not run during
import graphistry, and instead only triggers when
has_gpus gets called. Subsequent calls are free You might notice the code adds an explicit function call
lazy_torch() instead of directly calling
import: centralizing the call is a gift for our future selves in case we ever want to more easily see all the calls to any particular import.
For scenarios like static type checking with MyPy, we often still want to keep imported identifiers available in the global namespace, which conflicts with our goal of delaying the import statements. With modern Python, that’s totally fine!
The trick is to use the variable
TYPE_CHECKING, where we can make static tools like MyPy run the imports during CI and skip them at run time:
from typing import Any, TYPE_CHECKING if TYPE_CHECKING: from gremlin_python.driver.client import Client else: Client = Any def graphistry_gremlin(c: Client): ...
In the code snippet, the type Client is only needed during type checking. The conditional will run the import then, but not at actual runtime. Actual uses of Client within functions can use lazily loaded versions of the module through the original lazy loading pattern.
The good news is the time cut down to 6-7s, which is 2-3X smaller!
The bad news is… 6-7s is still way slower than we expect out of great software.
You get what you measure: Import profiling with Python & Tuna
The old adage “You get what you measure (so be careful what you measure)” is important for optimization. The first step to optimization is measurement, so we know where to bother spending our time… and where not to. In this case, we knew going in that a few of the AI imports were taking most of the time, but after knocking those out, it was unclear where to go next.
Instead of trying random optimizations or pulling out other imports… we quickly measured the import time:
To make the telemetry (much) easier to understand, we ran the logs through the Tuna visualizer:
You can see the Tuna visualization in Figure 2 above. A new set of 3 dependencies visually popped out as taking 75% of remaining time. Low-hanging fruit is the best! We weren’t surprised to see half of that time was in
RAPIDS.ai (GPU data frames) and
Dask (distributed data frames), but as we had a few other candidates we were suspicious of, profiling saved us time by identifying those 2 in particular. Much more surprising to us, half of the remaining import time was
colorbrewer: we did not expect a simple library of colorblind-safe palettes to cause such a slowdown!
After optimizing those 3, we successfully reduced the import time down to a much more reasonable 100-200ms in typical conditions, and 1-2s for a fully cold start:
When used by the Graphistry visual graph AI server, and with the instance-per-core, the server start time went down from minutes to seconds. Just a couple of hacking sessions, and PyGraphistry[ai] is ready for server use as well. Tuna shows that we can easily shave off another 10%+ by lazy-loading requests, but with 75% of the remaining time now just Pandas and Numpy, we’ve hit diminishing returns. Instead, we might want to dig into contributing patches to our upstream dependencies or looking at other types of optimizations.
Changing just 6 imports for 1000X spedups in the typical case, and 10X in the worst case, is pretty good for just a couple of sessions!
As your Python startup times creep up, lazy loading is a powerful design pattern that can be easily and predictably applied. The core pattern itself is fairly simple, and you can go in multiple directions with it, such as our TYPE_CHECKING scenario.
Furthermore, measurement is one of our most important tools when optimizing. The combination of
python -X importtime -c "import graphistry" for telemetry and Tuna for visualizing it is an easy way to go. Half the work was obvious, but with the combination of importime + Tuna, we immediately spotted the other half to target as well.
Optimization is a tricky best. Graphistry makes analysts finding connections in their data more productive, so our steady rollout of automatic AI algorithms is key to helping analysts. At the same time, we can’t let those tools become too heavy themselves, or they’ve defeated their purpose. Many of the tool optimizations are tricky, as they combine irregular data parallel algorithms with microoptimizations so we can stay 10ms interactivity budgets, they often boil down to simple steps. Lazy loading and measurement are two of our common tools, and the recent mini-sprint on cleaning up the new PyGraphistry[AI] imports are great examples of those.
import graphistrynow runs in 100-200ms for typical use!