Create a Spark DataFrame from Pandas or NumPy with Arrow

If you are a Pandas or NumPy user and have ever tried to create a Spark DataFrame from local data, you might have noticed that it is an unbearably slow process. In fact, the time it takes to do so usually prohibits this from any data set that is at all interesting. Starting from Spark 2.3, the addition of SPARK-22216 enables creating a DataFrame from Pandas using Arrow to make this process much more efficient. You can now transfer large data sets to Spark from your local Pandas session almost instantly and also be sure that your data types are preserved. This post will demonstrate a simple example of how to do this and walk through the Spark internals of how it is accomplished.

Read More

Model Parallelism with Spark ML Tuning

Tuning machine learning models in Spark involves selecting the best performing parameters for a model using CrossValidator or TrainValidationSplit. This process uses a parameter grid where a model is trained for each combination of parameters and evaluated according to a metric. Prior to Spark 2.3, running CrossValidator or TrainValidationSplit will train and evaluate one model at a time in serial, until each combination in the parameter grid has been evaluated. Spark of course will perform data parallelism throughout this process as usual, but depending on your cluster configuration this could leave resources severely under-utilized. As the list of hyperparameters increases, the time to complete a run can grow exponentially so it is crucial that all available resources are maxed out. Introducing model parallelism allows Spark to train and evaluate models in parallel, which can help keep resources utilized and lead to dramatic speedups. Beginning with Spark 2.3 and SPARK-19357, this feature is available but left to run in serial as default. This post will show you how to enable it, run through a simple example, and discuss best practices.

Read More

Vectorized UDFs in PySpark

With the introduction of Apache Arrow in Spark, it makes it possible to evaluate Python UDFs as vectorized functions. In addition to the performance benefits from vectorized functions, it also opens up more possibilities by using Pandas for input and output of the UDF. This post will show some details of on-going work I have been doing in this area and how to put it to use.

Read More

Spark Cross-Validation with Mulitple Pipelines

Cross-Validation with Apache Spark Pipelines is commonly used to tune the hyperparameters of stages in a PipelineModel. But what do you do if you want to evaluate more than one pipeline with different stages, e.g. using different types of classifiers? You would probably just run cross-validation on each pipeline separately and compare the results, which would generally work fine. You might not know that stages are actually a parameter in the PipelineModel and can be evaluated just like any other parameter, with a few caveats. So it is possible to put multiple pipelines into the Spark CrossValidator to automatically select the best one and make more efficient use of your data and caching. This post will show you how.

Read More

Spark toPandas() with Arrow, a Detailed Look

The upcoming release of Apache Spark 2.3 will include Apache Arrow as a dependency. For those that do not know, Arrow is an in-memory columnar data format with APIs in Java, C++, and Python. Since Spark does a lot of data transfer between the JVM and Python, this is particularly useful and can really help optimize the performance of PySpark. In my post on the Arrow blog, I showed a basic example on how to enable Arrow for a much more efficient conversion of a Spark DataFrame to Pandas. Following that, this post will take a more detailed look at how this is done internally in Spark, why it leads to such a dramatic speedup, and what else can be improved upon in the future.

Read More

Never thought I'd blog, but here we go..

It’s probably the engineer in me, but I’d much rather be programming that writing a dumb blog. I generally don’t care for them, and never had any interest in writing one. They’re mostly full of noise and fluff, or rehashed ideas but… these days I’ve been working a lot in open source, and I’ve seen some great posts. I guess sometimes a blog is the best way to spread useful information or build up some interest around a good idea. Hopefully, this will accomplish that and give back to some of the great online communities I have been working with - or maybe all of it is just noise and fluff, I don’t really know.

Read More