Amit Indap Bioinformatics & Data Science Blog

Musings on Bioinformatics, Data Science, Python, R, and more.

View My GitHub Profile

17 February 2025

Polars Lazyframes

by Amit Indap

I’ve been taking a closer look at Polars, a Rust based dataframe library with a Python frontend. Polars is similar to Pandas, but is faster and more memory efficient. Polars is also designed to work with large datasets that don’t fit into memory.

One of the features I’ve been exploring is the ability to work with lazyframes. Lazyframes are a way to delay computation until it is absolutely necessary. This can be useful when working with large datasets, as it can save time and memory. Take a look at the lazy API concept for more information as well as the the lazy user guide.

As a real world example, I collect all the solar panel production data for my residence in a collection of CSV files. The code below using the lazy evaluation to read in the data and perform and calculate average monthly production.


import polars as pl

q = (
    pl.scan_csv("MonthlyData/*.csv", try_parse_dates=True)
    .select(
        # Convert the Time column to a proper date
        pl.col("Time")
          .str.strptime(pl.Date, format="%m/%d/%Y")
          .alias("Date"),
        # Divide by 1000 and alias as kWh for clarity
        (pl.col("System Production (Wh)") / 1000)
          .alias("System Production (kWh)")
    )
    .group_by(pl.col("Date").dt.month().alias("month"))
    # Now compute the average
    .agg(
        pl.col("System Production (kWh)").mean().round(2).alias("avg_production_kWh")
    )
    # round avg_production_kWh to 2 decimal places
    
    .sort("month")
)

# Finally, collect the results if you want them in-memory
df = q.collect()




print(df)

The key point is the query is only executed once the collect() method is called.

The query syntax reminds me a lot R tidyverse syntax, which I always found very intuitive. I do more work in Python these days, but Pandas syntax I find a bit cumbersome. At first glance, Polars syntax seems more intuitive and easier to read.

I’m hoping to combine this code with an Amazon Lambda function to automate the data collection and analysis for my solar panel production.

tags: Python