11 min read

Getting hy with pandas

TL;DR

In this post, we scratch the surface of Hy, a lisp dialect for python, by converting a pandas pipeline. The post assumes some familiarity with pandas.

Introduction

I find the Hy project really interesting. From the website:

Hy is a wonderful dialect of Lisp that’s embedded in Python. Since Hy transforms its Lisp code into the Python Abstract Syntax Tree, you have the whole beautiful world of Python at your fingertips, in Lisp form!

If you rely on some python libraries, but wish you could write code closer to clojure, why not try it out?

In this post, we will convert a small pandas pipeline from python to Hy.

Lisps

From Lisp’s wikipedia page:

Lisp is an expression oriented language. Unlike most other languages, no distinction is made between “expressions” and “statements”; all code and data are written as expressions. When an expression is evaluated, it produces a value (in Common Lisp, possibly multiple values), which can then be embedded into other expressions. Each value can be any data type.

If you have never seen a Lisp before, start with Hy’s intro. The first pages of “Clojure for the Brave and True” also provide an entertaining introduction to Lisp syntax.

Hy syntax

To discover Hy, you can write small Hy scripts and run the hy2py <hy-file> command on it without flags. This will show you the resulting python code. The Hy REPL is also helpful for messing around.

While trying to write my first Hy script, I had a few struggles.

Badly closed parenthesis

I guess it’s a common problem with Lisp. I often made error and got messages like below.

For too many parenthesis:

LexException: Ran into a RPAREN where it wasn't expected.

For too few parenthesis:

LexException: Premature end of input

Double quote your strings

In Hy, strings are written with double quotes. Single quotes are used for something completely different. I fell into that trap again and again, raising exceptions like:

LexException: Could not identify the next token.

Accessing methods vs attributes

Basic function call in Hy goes like this (note that arguments are only separated by spaces):

(<function-name> <arg1> <arg2> ...)

If you need to use named arguments, you prefix their names with a :.

(<function-name> <arg1> :name-for-arg2 <arg2> ...)

What about methods? How do you translate something like "my-string-to-split".split("-") to Hy. If you want to use the <str>.split(<del>) method on a string, you use .split as the function name, place the string you want to use .split on as the first argument and finally add the argument(s) for .split (the delimeter) as the second argument to your expression.

(.split "my-string-to-split" "-")
=> ["my", "string", "to", "split"]

If the target of your method is saved on a variable, you can actually use a more familiar syntax.

(setv mystr "my string to split")
(mystr.split  " ")
=> ["my", "string", "to", "split"]

This is less flexible (needs to be a variable) but can be convenient to access methods on imported packages.

(import os)

(.listdir os)
=> ['file1', 'file2',...]

;; or alternatively
(os.listdir)
=> ['file1', 'file2',...]

Note that we do not use (.listdir() os) or (os.listdir()). The outer wrapping of the expression between parenthesis is already what calls the .listdir function.

What about attributes? What if we do not want to call a function? Just omit the parenthesis around the attribute call. For example, a successful get request to a webpage done with the requests package will have a text attribute containing the html of the page.

(import requests)

(requests.get "https://www.duckduckgo.com").text
=> ERROR - 'cannot access attribute on anything other than a name'

Right… We saw before that the straight dot-notation only works on variable. Let’s try with a variable:

(import requests)

(setv r (requests.get "https://www.duckduckgo.com"))
r.text
=> '<!DOCTYPE html>\n<!--[if IEMobi...

If we don’t want to set a variable and keep a “functional” style (which has many advantages), we can use the . function, which is used to perform attribute access on objects.

(import requests)

(. (requests.get "https://www.duckduckgo.com") text)
=> '<!DOCTYPE html>\n<!--[if IEMobi...

Chaining function

In python, if you don’t want to save every single steps while you apply transformation to a variable, you have two options.

If you are dealing with functions, you can nest them.

var1 = x
var2 = func1(var1)
var = func2(var2)

# becomes
var = func2(func1(x))

If you are dealing with methods, you can chain them.

var1 = x
var2 = var1.method1()
var = var2.method2()

# becomes
var = x.method1().method2()
# also work on multiple line if you use parenthesis
var = (x.method1()
        .method2())

Unlike nesting, chaining has the advantage of letting you read code in the execution order (deeply nested expressions are hard to read). However, chaining only works with methods and each method must return the type of object on which the next method is defined.

# This works!
("blabla  ".upper()  #.upper is defined on str and return str
           .strip()) #.strip is defined on str and return str
# => "BLABLA"

# This doesn't work!
([1,3,2].sort()      #.sort is defined on list and return None
        .append(4))  #.append is defined on list, not None
# => AttributeError: 
# => 'NoneType' object has no attribute 'append'

One of the nicest pattern in Lisp is the ability to use threading macro (->).

Threading macro (->) is used to avoid nesting of expressions. The threading macro inserts each expression into the next expression’s first argument place.

When using threading macro, each expression gets piped into the next expression as its first argument (you don’t even have to write it), leading to code that can be read in its evaluation order.

This can replace nesting,

func2(func1(x))

# becomes
(-> x
  (func1)
  (func2))

chaining,

(x.method1()
  .method2())

# becomes

(-> x
  (.method1)
  (.method2))

or a mix of both! Note that you don’t write the first argument, so you can start directly with the second one.

func1(x.method1(), <arg2-name>=<val-arg2>)

# becomes

(-> x
  (.method1)
  (func1 :arg2-name <val-arg2>))

Lambda function

fn function in hy (equivalent of python’s lambda) are written like below.

lambda x: 2017 - x.year_of_birth

becomes

(fn [x] (- 2017 x.year_of_birth))

Pandas pipeline

Pandas is a vast library. The pipeline below just use what I consider to be a subset of common operations on dataframe:

To try out a nice range of functions, we will do some useless operations, like renaming columns before droping them.

We will use a table listing the elected candidates of the 2015 Swiss National Council and try to get the average age of the candidates for the french-speaking “cantons” (political districts).

This table’s HTML is poorly formatted: although the column headers get a special styling, they are not <th> but just <tr> tags, like the rows.

<h2>Les candidats élus</h2>
<table class="nrwtab2">
<tr>
<td width="200px" bgcolor="#414141" align="left"><font color="#FFFFFF">Nom</font></td>
<td width="50px" bgcolor="#414141" align="center"><font color="#FFFFFF">Né</font></td>
<td width="500px" bgcolor="#414141" align="left"><font color="#FFFFFF">Liste</font></td>
<td width="50px" bgcolor="#414141" align="center"><font color="#FFFFFF">Canton</font></td>
</tr>
<tr>
<td width="200px" bgcolor="#FFFFFF" align="left">Addor Jean-Luc</td>
<td width="50px" bgcolor="#FFFFFF" align="center">1964</td>
<td width="500px" bgcolor="#FFFFFF" align="left">UDC Valais Central</td>
<td width="50px" bgcolor="#FFFFFF" align="center">VS</td>
</tr>
...

So, when using read_html, we will need to specify that the headers are stored in the first row (at index 0). In python, the pipeline should look similar to the below :

def get_avg_age_of_elected():
    url = 'https://www.admin.ch/ch/f/pore/nrw15/list/ch_elu.html'
    # Scrape the HTML from the URL
    html = requests.get(url).text
    # Get a list of html tables converted to dataframe
    dfs = pd.read_html(html, header=0) 
    # Get first dataframe in list 
    df = dfs[0]  
    # Rename columns to lowercase
    df = df.rename(columns=str.lower) 
    # Translate column names to english
    df = df.rename(columns={'nom': 'name', 
                            'né': 'year_of_birth', 
                            'liste': 'political_group'})
    # Drop not-needed columns
    df = df.drop(['name', 'political_group'], axis=1) 
    # Calculate an age column based on year_of_birth
    df = df.assign(age=(lambda x: (2017 - x.year_of_birth)))
    # Filter columns to keep only 'canton' and 'age'
    df = df.filter(items=['canton', 'age'], axis=1)
    # Group by cantons
    df = df.groupby(['canton'])
    # Calculated the mean in remaining column ('age')
    # on the grouped-by-canton df
    df = df.mean()
    # Filter rows to keep only french-speaking cantons
    # (after our groupby, 'canton' has become the row index)
    df = df.filter(regex='GE|VS|VD|FR|NE|JU', axis=0))
    return df

This is quite long-winded. It would be more elegant to use method-chaining.

def get_avg_age_of_elected():
    url = 'https://www.admin.ch/ch/f/pore/nrw15/list/ch_elu.html'
    html = requests.get(url).text
    dfs = pd.read_html(html, header=0) 
    df = (dfs[0]  
           .rename(columns=str.lower)
           .rename(columns={'nom': 'name', 
                            'né': 'year_of_birth', 
                            'liste': 'political_group'}) 
           .drop(['name', 'political_group'], axis=1)
           .assign(age=(lambda x: (2017 - x.year_of_birth)))
           .filter(items=['canton', 'age'])
           .groupby(['canton'])
           .mean()
           .filter(regex='GE|VS|VD|FR|NE|JU', axis=0))
    return df
    
get_avg_age_of_elected()

# =>            age
# canton           
# FR      56.714286
# GE      51.000000
# JU      62.500000
# NE      55.500000
# VD      51.833333
# VS      44.750000

Hy pipeline

A strict translation to hy could look like:

(import [pandas :as pd])
(import requests)

(defn get_avg_age_of_elected []
  (setv url "https://www.admin.ch/ch/f/pore/nrw15/list/ch_elu.html")
  (setv html (request.get url))
  (setv dfs (pd.read_html html :header 0))
  (-> dfs
    (.        [0])
    (.rename  :columns str.lower) 
    (.rename  :columns {"nom" "name" 
                        "né" "year_of_birth"
                        "list" "political_group"})
    (.drop    ["name" "political_group"] :axis 1)
    (.assign  :age (fn [x] (- 2017 x.year_of_birth)))
    (.filter  :items ["canton" "age"])
    (.groupby ["canton"])
    (.mean)
    (.filter  :regex "GE|VS|VD|FR|NE|JU" :axis 0))))

Let’s leverage some of Hy strengths. Firstly we can express the first 3 variables in the threading macro, without needing nesting. There is also a first function that return the first element of a collection and can replace (. [0]).

(import [pandas :as pd])
(import requests)

(defn get_avg_age_of_elected []
  (-> "https://www.admin.ch/ch/f/pore/nrw15/list/ch_elu.html"
    (requests.get)
    (. text)
    (pd.read_html :header 0)
    (first)
    (.rename      :columns str.lower) 
    (.rename      :columns {"nom" "name" 
                            "né" "year_of_birth"
                            "list" "political_group"})
    (.drop        ["name" "political_group"] :axis 1)
    (.assign      :age (fn [x] (- 2017 x.year_of_birth)))
    (.filter      :items ["canton" "age"])
    (.groupby     ["canton"])
    (.mean)
    (.filter      :regex "GE|VS|VD|FR|NE|JU" :axis 0))))

Secondly, fn functions (the lambda equivalent) can be on multiple lines and have docstrings! Since we aren’t saving the function, the docstring isn’t very useful for help but it makes a nice alternative to comments.

(import [pandas :as pd])
(import requests)

(defn get_avg_age_of_elected []
  (-> "https://www.admin.ch/ch/f/pore/nrw15/list/ch_elu.html"
    (requests.get)
    (. text)
    (pd.read_html :header 0)
    (first)
    (.rename      :columns str.lower) 
    (.rename      :columns {"nom" "name" 
                            "né" "year_of_birth"
                            "list" "political_group"})
    (.drop        ["name" "political_group"] :axis 1)
    (.assign      :age (fn [x] 
                        "Calculate rough age"
                        (- 2017 x.year_of_birth)))
    (.filter      :items ["canton" "age"])
    (.groupby     ["canton"])
    (.mean)
    (.filter      :regex "GE|VS|VD|FR|NE|JU" :axis 0))))

Last but not least, there are some patterns in pandas that would break a nice chain of methods. For example, we used .filter to filter rows. This only works on indexes so we applied the method after df.groupby (which made column canton the row index). A more common way to filter rows is to use .loc. But .loc isn’t a method like .filter, it’s an attribute that takes an input. That’s easy to miss, I definitely did and wasn’t the only one: as of today, the only question about hy+pandas in stackoverflow is about this.

The other tricky part is that .loc takes an indexer as input, and this indexer often refers to the dataframe being filtered. In vanilla python, we would have to break our chain to be able to use the dataframe resulting from the first 3 methods in our indexer.

# This won't work
df = (df.method1()
        .method2()
        .method3())
        .loc[df.colx == y] # The indexer refer to the original df
        .method4()
        .method5())

# This would work
df1 = (df.method1()
        .method2()
        .method3())

df2 = df1.loc[df1.colx == y]

df3 = (df2.method4()
          .method5())

When we applied (. text) to the response from our requests.get request, we saw that we can access attributes in the threading macro. However, the threading macro only puts the result of the previous function as the first argument, so how can we reuse it in the indexer? Luckily, there is an alternative to (->): (as->). (as->) is similar to (->) but after the first expression you specify a name for the result being passed around (e.g it). It requires a bit more typing because you now have to enter the first argument (it) for each function. But since this argument now has a name (it), you can reuse it multiple times.

(import [pandas :as pd])
(import requests)

(defn get_avg_age_of_elected []
  (as-> "https://www.admin.ch/ch/f/pore/nrw15/list/ch_elu.html" it
    (requests.get it)
    (. it text)
    (pd.read_html it :header 0)
    (first it)
    (.rename it      :columns str.lower) 
    (.rename it      :columns {"nom" "name" 
                               "né" "year_of_birth"
                               "list" "political_group"})
    (.drop it        ["name" "political_group"] :axis 1)
    (. it loc [(isin it.canton ['GE', 'VS', 'VD', 'FR', 'NE', 'JU'])])
    (.assign it      :age (fn [x] 
                          "Calculate rough age"
                          (- 2017 x.year_of_birth)))
    (.filter it      :items ["canton" "age"])
    (.groupby it    ["canton"])
    (.mean it))))

Let’s use hy2py to check the converted python code. Note that if you use docstring in your fn, they will be converted to function (def) rather than lambda.

# Without docstring in fn
import pandas as pd
import requests

def get_avg_age_of_elected():
    it = 'https://www.admin.ch/ch/f/pore/nrw15/list/ch_elu.html'
    it = requests.get(it)
    it = it.text
    it = pd.read_html(it, header=0)
    it = it[0]
    it = it.rename(columns=str.lower)
    it = it.rename(columns={'nom': 'name', 'né': 'year_of_birth', })
    it = it.drop(['name', 'liste'], axis=1)
    it = it.assign(age=(lambda x: (2017 - x.year_of_birth)))
    it = it.filter(items=['canton', 'age'])
    it = it.loc[it.canton.isin(['GE', 'VS', 'VD', 'FR', 'NE', 'JU'])]
    it = it.groupby(['canton'])
    it = it.mean()
    return it

# With docstring in fn
import pandas as pd
import requests

def get_avg_age_of_elected():
    it = 'https://www.admin.ch/ch/f/pore/nrw15/list/ch_elu.html'
    it = requests.get(it)
    it = it.text
    it = pd.read_html(it, header=0)
    it = it[0]
    it = it.rename(columns=str.lower)
    it = it.rename(columns={'nom': 'name', 'né': 'year_of_birth', })
    it = it.drop(['name', 'liste'], axis=1)

    def _hy_anon_fn_1(x):
        'Calculate rough age'
        return (2017 - x.year_of_birth)
    it = it.assign(age=_hy_anon_fn_1)
    it = it.filter(items=['canton', 'age'])
    it = it.loc[it.canton.isin(['GE', 'VS', 'VD', 'FR', 'NE', 'JU'])]
    it = it.groupby(['canton'])
    it = it.mean()
    return it