Maximize Your Python Code Performance with Vaex: A Game Changer
Written on
Chapter 1: Introduction to Vaex
Pandas serves as a fundamental library for data manipulation in Python, offering numerous functions for data cleaning, transformation, and analysis. However, it often struggles with large datasets that exceed memory capacity, causing slowdowns and even system crashes. If you've ever faced your computer freezing while processing hefty datasets, you're likely familiar with this dilemma.
Fortunately, there's a solution: Vaex.
What is Vaex?
Vaex is a robust library tailored for managing large datasets that cannot fit into memory. It employs a lazy and out-of-core computing approach, meaning it only loads data into memory when necessary, allowing for operations on substantial datasets that would be unmanageable with Pandas alone.
In this article, we will delve into Vaex's capabilities, comparing it with Pandas to illustrate how it can efficiently handle extensive datasets. Get ready to bid farewell to sluggish data processing and embrace the speed of Vaex!
Chapter 2: Getting Started with Vaex
To begin utilizing Vaex, install it using the command below. If you're operating within a Jupyter Notebook, prefix the command with an exclamation mark.
pip install vaex
Once installed, you can import Vaex alongside Pandas and start working with your dataset. For our demonstration, we will utilize the Health Insurance Marketplace dataset, which is approximately 2GB in size.
import pandas as pd
import vaex
df = pd.read_csv("Rate.csv")
df2 = pd.concat([df, df, df, df, df, df]) # To increase the size/rows of the dataframe
df_vaex = vaex.from_pandas(df2)
print(df.shape) # (76,166,670, 24)
Our dataset comprises around 76 million rows and 24 columns.
Chapter 3: Performance Comparisons
Now that our dataset is ready, let’s evaluate the execution time of common operations using the timeit library.
Section 3.1: Aggregation Speed Test
Aggregating data with Vaex proves to be significantly faster compared to Pandas. Below, we group the data by the "IssuerId" column and count the number of entries in each group.
import timeit
# Using Pandas
pandas_time = timeit.timeit(lambda: df2.groupby("IssuerId").count(), number=1)
# Using Vaex
vaex_time = timeit.timeit(lambda: df_vaex.groupby("IssuerId").agg({'IssuerId':'count'}), number=1)
print("Pandas time:", pandas_time)
print("Vaex time:", vaex_time)
The results show that Vaex performs this aggregation approximately 236 times faster than Pandas!
Section 3.2: Filtering and Sorting Efficiency
We can also examine the speed of filtering and sorting operations.
# Using Pandas
pandas_time = timeit.timeit(lambda: df2[df2['IndividualRate'] > 30].sort_values(by='IndividualRate'), number=1)
# Using Vaex
vaex_time = timeit.timeit(lambda: df_vaex[df_vaex.IndividualRate > 30].sort('IndividualRate'), number=1)
print("Pandas time:", pandas_time)
print("Vaex time:", vaex_time)
The analysis reveals that Vaex executes filtering and sorting operations around 9 times faster than Pandas.
Section 3.3: Handling Null Values
Addressing null values is a standard procedure in data management. Let's compare the time taken to replace NaN values in the 'IndividualTobaccoRate' column with 'Unknown'.
# Using Pandas
pandas_time = timeit.timeit(lambda: df.fillna({'IndividualTobaccoRate': 'Unknown'}), number=1)
# Using Vaex
vaex_time = timeit.timeit(lambda: df_vaex.fillna({'IndividualTobaccoRate': 'Unknown'}), number=1)
print("Pandas time:", pandas_time)
print("Vaex time:", vaex_time)
Vaex accomplishes this task at an astonishing speed, approximately 450 times faster than Pandas.
Section 3.4: Value Counting Performance
Next, we evaluate the execution time for counting unique values in a column.
pandas_time = timeit.timeit(lambda: df2['SourceName'].value_counts(), number=1)
vaex_time = timeit.timeit(lambda: df_vaex['SourceName'].value_counts(), number=1)
print("Pandas time:", pandas_time)
print("Vaex time:", vaex_time)
Vaex exhibits approximately 1.7 times faster performance than Pandas in this operation.
Section 3.5: Histogram Creation Speed
Lastly, let's analyze the time taken to create a histogram.
import matplotlib.pyplot as plt
pandas_time = timeit.timeit(lambda: plt.hist(df2.IssuerId), number=1)
vaex_time = timeit.timeit(lambda: df_vaex.viz.histogram(df_vaex.IssuerId), number=1)
print("Pandas time:", pandas_time)
print("Vaex time:", vaex_time)
While both libraries perform well, Vaex generates histograms about 3 times faster than Pandas.
Chapter 4: The Role of Vaex in Data Processing
So, can Vaex fully replace Pandas? The answer is no. Vaex is not a substitute for Pandas but rather a powerful companion for handling large datasets that would be challenging for Pandas. While Vaex excels in speed for substantial datasets, for smaller datasets or operations that can be easily parallelized, Pandas remains the preferred tool.
For a deeper understanding of how Vaex can be utilized for massive datasets, check out this insightful blog.
Thank You for Reading!
I hope you found this information valuable. Feel free to share your thoughts in the comments section.
If you enjoyed this content, consider following me for updates on future posts. If you wish to explore Medium, supporting writers with a membership only costs $5 per month and makes a significant difference.