Create a Spark DataFrame from Pandas or NumPy with Arrow
If you are a Pandas or NumPy user and have ever tried to create a Spark DataFrame from local data, you might have noticed that it is an unbearably slow process. In fact, the time it takes to do so usually prohibits this from any data set that is at all interesting. Starting from Spark 2.3, the addition of SPARK-22216 enables creating a DataFrame from Pandas using Arrow to make this process much more efficient. You can now transfer large data sets to Spark from your local Pandas session almost instantly and also be sure that your data types are preserved. This post will demonstrate a simple example of how to do this and walk through the Spark internals of how it is accomplished.