Working with large datasets in Python

How can you work with large datasets in Python — millions, billions, or more records?

Here is a great answer to that question:

  1. Pandas is memory hungry, you may need 8-16GB of memory or more to load your dataset into memory to work efficiently.

You can use large / extra large AWS cloud systems for temporary access. Being able to spin up cloud platforms on demand is only one part of the equation. You also need to get your data in and out of the cloud platform. So persistent storage and on-demand compute is a likely strategy.

  1. Work in stages, and save each stage.

You will also want to be able to save your intermediate states. If you process CSV or JSON — or perform filter, map, reduce, etc functions you’ll want those to be atomic processes.

And you’ll want to persist work as you go. If you process 100 million rows of data and something happens on row 99 million, you don’t want to have to re-do the whole process to get a clean data transformation. Especially if it takes several minutes or hours.

Better to save each stage iteratively and incur the IO cost in your ETL or processing loop than to lose your work — or have your data corrupted.

  1. Take adavantage of multiprocessing

Break work into batches that can be performed in parallel. If you have multiple CPUs, take advantage of them.

Python doesn’t do this by default, and Pandas normally works on a single thread. Dask or Vaex can work in parallel where Pandas itself cannot.
You might also consider using a Streaming processor such as Apache Spark instead of doing all your processing in a single DataFrame.

  1. Use efficient functions and data objects

Earlier I talked about saving incrementally within your processing loop. But don’t go overboard.

You do not want to open and close files every iteration in your inner loop millions of times. Make sure you fetch the data you need only once. Don’t reinitialize objects. Even someting as simple as calling len(data) can really add up. Finding out the length of an expanding list with millions of rows millions of times really adds up.

Also, consider when you want to use a list vs a numpy array, etc.