Graphistry 2.34 brings big speed and scale capabilities for visually exploring big graphs. The new File API is especially exciting for us – the current release uses it for faster Jupyter/Python plotting, and watch the next release that will bring in some long-awaited UI features around it. Many enterprise users will appreciate the improved Red Hat and vGPU support. The biggest news of all, however, is we have begun rolling out the next phase of multi-GPU support: acceleration and scale. This includes a RAPIDS 0.17 upgrade (BlazingSQL, cudf dataframes, cuGraph, cuML, …), a Dask GPU cluster launched to automatically expose all your GPUs, and the first multi-GPU-capable Graphistry internal APIs.
If you’re new to Graphistry, you can explore relationships and correlations in your data for free from your current data science notebook or web environment via a Graphistry Hub account, or one-click launch your own Graphistry server for private data into your private AWS/Azure account.
For a deeper dive in to Graphistry 2.34.14, check the official change log. Th release is already available on our enterprise portal and is going through cloud marketplace approvals.
File REST API: 2-10X Faster notebook sessions
The File API is the building block for new v2.34 and upcoming features that use the idea of separating structured data file uploads from the visualizations that use them. Notebook users can immediately benefit from the File API for quick speedups when making different visualizations over the same data. Consider the following notebook session:
### Graph 1 df = pd.read_csv('./800K_events.csv.gz') g = graphistry.edges(df, 'user', 'ip').bind(point_color='col1').name('mygraph') g.plot(as_files=True) # uploads `df` as file `"mygraph edges"` and plots ### Graph 2 graphistry.edges(df, 'user', 'address').plot(as_files=True) # detects reuse of df: GPU sessions plots on already-uploaded data!
By adding argument
as_files=True to calls to
plot(), the PyGraphistry Python client uses the server’s new File REST APIs. It checks if the data’s hash matches another dataset already uploaded in the current session, and if so, skips that part of the upload. Otherwise, it uploads a new file for you. A nodes table and edges table count as two separate files, so you can modify only one table between visualization, PyGraphistry will be smart enough to reuse the other.
The payoff is great. Even with fast internet connections, we measured 100 edge visualizations uploading 2X faster, 100K row ones 10X faster, and so on. We expect the benefit to be a must-have for users on slower networks or generating many files. If you’re a notebook user, we recommend keeping it on, and we’ll likely switch to default-on in future PyGraphistry versions.
The slowdown on first-time datasets is a low 1% because we use optimized hash computations, including for GPU dataframes. For datasets you know will not be reused in the current session, set
memoize=False. Likewise, if you know the underlying file_id, you can pass that in to the underlying APIs and skip the hash check that way as well.
The File REST API brings powerful capabilities for unlocking core use cases, so we will be sharing those in upcoming articles. The next release will provide the underlying REST API extensions for it as well – you can inspect the Python client source and documentation in the meanwhile.
Multi-GPU with Graphistry, Dask, and BlazingSQL
Going beyond Graphistry’s use of multiple GPUs for handling many concurrent user sessions, Graphistry 2.34 begins using multiple GPUs for handling bigger workloads too. This enables both handling current workloads faster, and handling even bigger graphs — including ones bigger than available GPU memory.
Notebook users, when spinning up Graphistry on a multi-GPU system, can directly use the built-in multi-GPU capabilities via Dask and BlazingSQL:
import cudf, dask_cudf from dask.distributed import Client #Automatically GPU-accelerated CSV reader gdf = cudf.read_csv('events.5gb.csv') #Automatically multi-GPU-accelerated analytics with Client('dask-scheduler:8786'): dgdf = dask_cudf.from_cudf(gdf, npartitions=20) score_sum = dgdf['score'].sum().compute() #Automatically GPU-accelerated visual analytics sessions gdf['score_norm'] = gdf['score'] / score_sum g.nodes(gdf)\ .encode_point_color('score_norm', ["blue", "yellow", "red"], as_continuous=True)\ .plot()
This demo shows using dask-cudf and BlazingSQL together on the same big dataset.
Behind the scenes, a few cool things are happening:
- Graphistry runs a managed Dask GPU cluster that notebook users can connect to (containers `dask-scheduler` and `dask-cuda-worker`)
- … and they automatically detect multiple GPUs when available
- The GPU cluster, and data, can be shared between any RAPIDS library like dask-cudf and BlazingSQL. They use this both to more quickly compute over a dataset, and chunk up datasets that would otherwise be too big for GPU memory.
- … And Graphistry internally started to accelerate workloads this way as well!
The File API already supports multi-GPU BlazingSQL at ingest time for uploading bigger-than-memory datasets and preprocessing them to something more intelligible, which we will be sharing in future posts. Graphistry Hub currently accepts uploads up to 200MB, and private servers default to 1GB, which you can increase.
Virtual GPU Support
Graphistry now has explicit Nvidia vGPU support. Use on virtual GPU clusters can be useful for easier administration, cost savings, and opens up the possibility of using idling GPU clusters (e.g., daytime VDI ones) for compute workloads.
To run Graphistry on a virtual machine with a vGPU, disable the CUDA Unified Memory allocator with a 1-line setting change. All other steps are the same as regular VM vGPU setup and Graphistry installation.
Top speedups and fixes
Beyond the File API mode, our favorites are:
- Speedups: Pageload and dataset filtering both got faster, as did moving nodes
- Fixes: Getting stuck at the ‘75% loading’ screen should occur significantly less, and the custom filter interpreter handles more edge cases