Datashader for Large Datasets

Purpose

When you have hundreds of thousands or millions of data points, traditional scatter plots become too slow and visually cluttered. Datashader rasterizes and aggregates points into a grid for efficient, artifact-free rendering.

Use Cases

Stock tick data (millions of prices per second).
Sensor readings from IoT devices.
Geographic data points (billions of GPS coordinates).
Scientific simulations with dense output.

Example in the Notebook

The notebook demonstrates datashader by:

Creating a synthetic dataset (tips repeated 2000 times with jitter) — ~500k rows.
Aggregating points into a canvas grid.
Shading the grid by point count using a colormap.
Falling back to Matplotlib hexbin if datashader is unavailable.

Key Code Snippet

import datashader as ds
import datashader.transfer_functions as tf

cvs = ds.Canvas(plot_width=800, plot_height=400)
agg = cvs.points(big, 'x', 'y', ds.count())
img = tf.shade(agg, cmap=colorcet.m_fire, how='eq_hist')

Customization Tips

Canvas size: Adjust plot_width and plot_height for resolution.
Aggregation: Use ds.count() (default), ds.mean(), ds.sum(), or custom reductions.
Colormaps: Choose from colorcet library (perceptually uniform): m_fire, m_viridis, etc.
Normalization: Use how='eq_hist' for histogram equalization or how='linear' for simple scaling.
Fallback: The notebook includes a sns.hexbin() fallback if datashader is not installed.

Installation

If datashader is not in your environment:

pip install datashader colorcet

When to Use

Always: for datasets with >100k points.
Consider: for exploratory analysis of massive datasets.
Avoid: if you need individual point interactivity (use aggregated tooltips instead).

Data Visualization Techniques Demo