---
jupytext:
  notebook_metadata_filter: myst
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.16.0
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
myst:
  html_meta:
    description lang=en: Documentation for the %sqlplot magic from JupySQL
    keywords: jupyter, sql, jupysql, plotting
    property=og:locale: en_US
---

# `%sqlplot`

```{versionadded} 0.5.2
```


```{note}
`%sqlplot` requires `matplotlib`: `pip install matplotlib` and this example requires
duckdb-engine: `pip install duckdb-engine`
```

```{code-cell} ipython3
%load_ext sql
```

```{code-cell} ipython3
%sql duckdb://
```

```{code-cell} ipython3
from pathlib import Path
from urllib.request import urlretrieve

if not Path("penguins.csv").is_file():
    urlretrieve(
        "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv",
        "penguins.csv",
    )
```

```{code-cell} ipython3
%%sql
SELECT * FROM "penguins.csv" LIMIT 3
```

```{note}
You can view the documentation and command line arguments by running `%sqlplot?`
```

## `%sqlplot boxplot`


```{note}
To use `%sqlplot boxplot`, your SQL engine must support:

`percentile_disc(...) WITHIN GROUP (ORDER BY ...)`

[Snowflake](https://docs.snowflake.com/en/sql-reference/functions/percentile_disc.html),
[Postgres](https://www.postgresql.org/docs/9.4/functions-aggregate.html),
[DuckDB](https://duckdb.org/docs/sql/aggregates), and others support this.
```

Shortcut: `%sqlplot box`

`-t`/`--table` Table to use (if using DuckDB: path to the file to query)

`-s`/`--schema` Schema to use (No need to pass if using a default schema)

`-c`/`--column` Column(s) to plot. You might pass one than one value (e.g., `-c a b c`)

`-o`/`--orient` Boxplot orientation (`h` for horizontal, `v` for vertical)

`-w`/`--with` Use a previously saved query as input data

```{code-cell} ipython3
%sqlplot boxplot --table penguins.csv --column body_mass_g
```

### Transform data before plotting

```{code-cell} ipython3
%%sql
SELECT island, COUNT(*)
FROM penguins.csv
GROUP BY island
ORDER BY COUNT(*) DESC
```

```{code-cell} ipython3
%%sql --save biscoe --no-execute
SELECT *
FROM penguins.csv
WHERE island = 'Biscoe'
```

Since we saved `biscoe` from the cell above, we can pass it as an argument to `--table` since jupysql autogenerates the CTE.

```{code-cell} ipython3
%sqlplot boxplot --table biscoe --column body_mass_g
```

### Horizontal boxplot

```{code-cell} ipython3
%sqlplot boxplot --table penguins.csv --column bill_length_mm --orient h
```

### Multiple columns

```{code-cell} ipython3
%sqlplot boxplot --table penguins.csv --column bill_length_mm bill_depth_mm flipper_length_mm
```

## `%sqlplot histogram`

Shortcut: `%sqlplot hist`

`-t`/`--table` Table to use (if using DuckDB: path to the file to query)

`-s`/`--schema` Schema to use (No need to pass if using a default schema)

`-c`/`--column` Column to plot

`-b`/`--bins` (default: `50`) Number of bins

`-B`/`--breaks` Custom bin intervals

`-W`/`--binwidth` Width of each bin

`-w`/`--with` Use a previously saved query as input data

```{note}
When using -b/--bins, -B/--breaks, or -W/--binwidth, you can only specify one of them. If none of them is specified, the default value for -b/--bins will be used.
```

+++

Histogram supports NULL values by skipping them. Now we can
generate histograms without explicitly removing NULL entries.
```{versionadded} 0.7.9
```

```{code-cell} ipython3
%sqlplot histogram --table penguins.csv --column body_mass_g 
```

When plotting a histogram, it divides a range with the number of bins - 1 to calculate a bin size. Then, it applies round half down relative to the bin size and categorizes continuous values into bins to replicate right closed intervals from the ggplot histogram in R.

![body_mass_g](../static/body_mass_g_R.png)

+++

### Specifying bins

Bins allow you to set the number of bins in a histogram, and it's useful when you are interested in the overall distribution.

```{code-cell} ipython3
%sqlplot histogram --table penguins.csv --column body_mass_g  --bins 100
```

### Specifying breaks

Breaks allow you to set custom intervals for a histogram. It is useful when you want to view distribution within a specific range. You can specify breaks by passing desired each end and break points separated by whitespace after `-B/--breaks`. Since those break points define a range of data points to plot, bar width, and number of bars in a histogram, make sure to pass more than 1 point that is strictly increasing and includes at least one data point.

```{code-cell} ipython3
%sqlplot histogram --table penguins.csv --column body_mass_g --breaks 3200 3400 3600 3800 4000 4200 4400 4600 4800
```

### Specifying binwidth

Binwidth allows you to set the width of bins in a histogram. It is useful when you directly aim to adjust the granularity of the histogram. To specify the binwidth, pass a desired width after `-W/--binwidth`. Since the binwidth determines details of distribution, make sure to pass a suitable positive numeric value based on your data.

```{code-cell} ipython3
%sqlplot histogram --table penguins.csv --column body_mass_g --binwidth 150
```

### Multiple columns

```{code-cell} ipython3
%sqlplot histogram --table penguins.csv --column bill_length_mm bill_depth_mm 
```

## Customize plot

`%sqlplot` returns a `matplotlib.Axes` object.

```{code-cell} ipython3
ax = %sqlplot histogram --table penguins.csv --column body_mass_g
ax.set_title("Body mass (grams)")
_ = ax.grid()
```

## `%sqlplot bar`

```{versionadded} 0.7.6
```

Shortcut: `%sqlplot bar`

`-t`/`--table` Table to use (if using DuckDB: path to the file to query)

`-s`/`--schema` Schema to use (No need to pass if using a default schema)

`-c`/`--column` Column to plot.

`-o`/`--orient` Barplot orientation (`h` for horizontal, `v` for vertical)

`-w`/`--with` Use a previously saved query as input data

`-S`/`--show-numbers` Show numbers on top of the bar

Bar plot does not support NULL values, so we automatically remove them, when plotting.

```{code-cell} ipython3
%sqlplot bar --table penguins.csv --column species 
```

You can additionally pass two columns to bar plot i.e. `x` and `height` columns.

```{code-cell} ipython3
%%sql --save add_col --no-execute
SELECT species, count(species) as cnt
FROM penguins.csv
group by species
```

```{code-cell} ipython3
%sqlplot bar --table add_col --column species cnt
```

You can also pass the orientation using the `orient` argument.

```{code-cell} ipython3
%sqlplot bar --table add_col --column species cnt --orient h
```

You can also show the number on top of the bar using the `S`/`show-numbers` argument.

```{code-cell} ipython3
%sqlplot bar --table penguins.csv --column species -S
```

## `%sqlplot pie`

```{versionadded} 0.7.6
```

Shortcut: `%sqlplot pie`

`-t`/`--table` Table to use (if using DuckDB: path to the file to query)

`-s`/`--schema` Schema to use (No need to pass if using a default schema)

`-c`/`--column` Column to plot

`-w`/`--with` Use a previously saved query as input data

`-S`/`--show-numbers` Show the percentage on top of the pie

Pie chart does not support NULL values, so we automatically remove them, when plotting the pie chart.

```{code-cell} ipython3
%sqlplot pie --table penguins.csv --column species
```

You can additionally pass two columns to bar plot i.e. `labels` and `x` columns.

```{code-cell} ipython3
%%sql --save add_col --no-execute
SELECT species, count(species) as cnt
FROM penguins.csv
group by species
```

```{code-cell} ipython3
%sqlplot pie --table add_col --column species cnt
```

Here, `species` is the `labels` column and `cnt` is the `x` column.


You can also show the percentage on top of the pie using the `S`/`show-numbers` argument.

```{code-cell} ipython3
%sqlplot pie --table penguins.csv --column species -S
```

## Parametrizing arguments

JupySQL supports variable expansion of arguments in the form of `{{variable}}`. This allows the user to specify arguments with placeholders that can be replaced by variables dynamically.

```{code-cell} ipython3
%%sql 
DROP TABLE IF EXISTS penguins;
CREATE SCHEMA IF NOT EXISTS s1;
CREATE TABLE s1.penguins (
    species VARCHAR(255),
    island VARCHAR(255),
    bill_length_mm DECIMAL(5, 2),
    bill_depth_mm DECIMAL(5, 2),
    flipper_length_mm DECIMAL(5, 2),
    body_mass_g INTEGER,
    sex VARCHAR(255)
);
COPY s1.penguins FROM 'penguins.csv' WITH (FORMAT CSV, HEADER TRUE);
```

```{code-cell} ipython3
table = "penguins"
schema = "s1"
orient = "h"
column = "bill_length_mm"
```

```{code-cell} ipython3
%sqlplot boxplot --table {{table}} --schema {{schema}} --column {{column}} --orient {{orient}}
```

Now let's see another example using `--with`:

```{code-cell} ipython3
snippet = "gentoo"
```

```{code-cell} ipython3
%%sql --save {{snippet}}
SELECT * FROM {{schema}}.{{table}} 
WHERE species == 'Gentoo'
```

```{code-cell} ipython3
%sqlplot boxplot --table {{snippet}} --with {{snippet}} --column {{column}}
```
