Connecting JS to modern GPU and ML frameworks: Update from Nvidia GTC 2018

Posted by Leo Meyerovich on April 4, 2018

 

The Graphistry team is excited to report: production-grade open GPU compute is coming to JavaScript with the Apache Arrow[JS] project and GOAI. We have been contributing to these projects because they are big enablers for the web. In our case, that means we can build best-of-class visual fastpaths for security and fraud teams struggling to investigate through tools like Splunk, Elastic, and Hadoop.

Read on for our path to making JS compute over GBs and eventually TBs of data in subsecond time. We’d love for even more people to get involved, such as contributing code or, for enterprises, engaging with us!

DGX-2 Announced at GTC 2018. We’re making these scriptable from JavaScript.

BACKGROUND: GPUs, Arrow, and JS

From phones to servers, GPUs are everywhere. The top supercomputers are made from them: Nvidia’s new DGX-2s run at a jaw-dropping 2 petaflops and have 512 GB of GPU RAM. Modern frameworks like Tensorflow and NVGraph already leverage the heck out of these. (… Contact us if you want to experiment with us on NVGraph!)

Unfortunately, modern data tools do not get along with one another, and especially not with JavaScript. A key blocker is due to their same-but-different data formats. Tensorflow, Spark, and Pandas all store data in ‘shredded, columnar’ layouts because that’s what parallel hardware needs. However, each evolved a slightly different data layout. Having to convert data on-the-fly tanks subsecond performance.. assuming there even is a convertor available. With the Dremio team, we recently dug into what all this means for the future of data visualization.

The Apache Arrow data format solves interop and is already being adopted by Spark, Pandas, and other projects. By agreeing on Arrow, passing data between Arrow-compliant frameworks requires no data conversions. For framework makers, that means writing fewer connectors, and for users, more interop and at faster speeds.

New layers do even more. With Plasma, we don’t even have to copy Arrow data, just pass a pointer. With the GOAI project, those pointers can be to GPU memory. 2018 is nuts: end-to-end GPU computing is becoming the new normal.

With Arrow[JS], we’re bringing all this technology to JavaScript developers. All of them. Imagine multi-GPU dataframes for sub-second analytics over billions of rows!

THREE PRINCIPLES

We care a lot about how compute in JavaScript happens:

  • Bridges vs. sandcastles. A trap we want to avoid is hobbled JavaScript rewrites of data infrastructure. We love WebGL2, but Tensorflow can run 100X over WebGL2 by leveraging multiple GPUs and using basic optimizations like memory fences. We have a clear path for connecting JS to existing & emerging best-of-class tech, and getting it done this year. Bridges, not sandcastles.
  • Open infrastructure. When Graphistry brings a third-party dependency to our customers, we’re wary of embedding anything closed source and even single-source open core. We were delighted when the GOAI startups joined up with Apache Arrow, and soon after, we donated our first code drop.
  • Rally around standards. We can achieve outsized impact by identifying interop points framework builders can target. When they do, users of every other tool in the ecosystem benefits. For example, as we’re starting to figure out an Arrow-aware ODBC variant, all Arrow-aware BI tools could have out-of-the-box fast data support for any Arrow-compliant database, even without the many vendors coordinating with one another.

JS GPU

Reference architectures for JS GOAI bridges to ML, GPU, and Big Data frameworks

TECHNOLOGY & ROADMAP

We’ve started with node and reference JS implementations. We expect mobile can follow in node’s footsteps, and standard browsers after. We’re progressing through several areas:

  • JS IO: A JS apache-arrow reader & writer, for async batch & streaming interop with Arrow format data, complete with examples for Pandas and MapD. A node-plasma binding for zero-copy sharing of CPU and GPU memory (100GB+). A node-goai buffer library for easily sharing data between the CPU and GPU.
  • Zero-copy nodejs<>python GPU web services. Data is passed via node-plasma. Reference GOAI-capable PyData Docker, and helper library for web requests in a GOAI-aware node web framework (express?) and GOAI-aware python web framework (flask?).
  • JS dataframes and graph compute: Symbolic compute for leveraging dataframe tech like PyGDF and Ray, upcoming graphframe tech like NVGraph, and upcoming Arrow-aware database tech like a more native Turbodbc. Ultimately, something like a js-linq for pushing code, not just data.
  • SQL & Cypher/Tinkerpop: Beyond the direct JS project, there has been pressure to make database wire protocols work out-of-the-box with Arrow, especially given modern systems are increasingly columnar. As a BI-for-investigations company, we are connecting teams here to develop ideas like an Arrow-aware ODBC that give one target.

GET INVOLVED

For open source coders, there’s a lot possible, and we welcome new and out-of-order efforts. Anything in the above roadmap is fair game! The github project is a great place to get started, or emailing someone on our team.

For industry partners, we are happy to engage in experimental projects for use of these technologies. The Graphistry team already regularly engages with the US government and the F500 for tackling problems in security, fraud, and fintech in how they use tools like Splunk, Elastic, and Hadoop. We are especially excited by connecting JS to PyData/PyGDF, and given the nature of our visual investigation product, building up tooling and usability of NVGraph. We’re often in SF, Austin, DC, and NYC and are happy to catch up – feel free to reach out!

Get a demo