Tidying Up Pandas

Tidying Up Pandas
Tidying Up Pandas

For those using the pandas module, the first thing you would quickly come to realiee is there are often more ways than one way to do just about everything.

The purpose of this article is to demonstrate how we can limit this to an extent by drawing inspiration from R’s dplyr and tidyverse libraries.

Coming from academia, R was often the go-to lingua franca for those around me (Computational Biology/Bioinformatics or Statistics). And like them, I was hooked on the famous tidyverse meta-package, which includes dplyr (previously plyr , ply(e)r), lubridate (time-series) and tidyr .

PS. As I am writing this article, I realised it isn’t just tidyverse , but the whole R ecosystem which I’ve come to love whist doing metagenomics and computational biology in general.

For the benefit of those who started from R, pandas is python’s dataframe module. Other packages like H2Oai’s datatable exists and it is heavily inspired by R’s own datatable library.

In his talk, Hadley Wickham (creator of the dplyr package) he mentioned what we really need for table manipulation are just a handful of functions:

  • filter
  • select
  • arrange
  • mutate
  • group_by
  • summarise
  • merge

However, I would argue you need a bit more than the above.

For example, knowing R’s family of apply functions will help greatly. Or a couple of summary statistics functions like summary or str , although nowadays I use skimr::skim for summaries mostly.

skim(iris) 
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## ── Variable type:factor ──────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n n_unique top_counts ordered
## Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0 FALSE
##
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▁▂▅▅▃▁
## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5 ▇▁▁▅▃▃▂▂
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ▂▇▅▇▆▅▂▂
## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4 ▁▂▅▇▃▂▁▁

In fact, the output of skimr looks really similar to Google’s Facets. Both as you can see, print basic distribution stats and histograms.

Tidying Up Pandas
Tidying Up Pandas

Thus, in this post I’ll try my best to demonstrate 1-to-1 mappings of the tidyverse vocabularies with pandas DataFrame methods.

For demonstration, We will be using the famous Iris flower dataset .

# python 
import seaborn as sns iris = sns.load_data("iris")
# R
iris

I’ve chosen to import the iris dataset using seaborn rather than using sklearn’s datasets.load_iris, this is because the latter imports the dataset as numpy arrays.

One of the first things I usually do when aftering importing a table is to run the str function on the table (the equivalent is the info method) And check if there are any NA

# R 
str(iris)
>
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# python 
iris.info(null_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length 150 non-null float64
sepal_width 150 non-null float64
petal_length 150 non-null float64
petal_width 150 non-null float64
species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB

Filter

Now, onto the next function. The closest method similar to R’s filter is pd.query method. In the example below cutoff has been defined as a variable

# R
iris %>% filter(sepal.width > cutoff)
# python
## Method 1: 
iris[iris.sepal_width > cutoff]
## Method 2:
iris.query("sepal_width > @cutoff”)

There’s two ways to do this in python. The first is probably what you’ll find most python users using.

However, pd.DataFrame.query() maps more closely with dplyr::filter() .

One downside of using query is… linters which follows the pep8 convention like flake8 will complain about the cutoff variable not being used although it has already been declared sepal_width > @cutoff . This is because the linters are unable to recognise the use of cutoff inside the query quoted string.

Surprisingly, filter makes a return in pySpark. :)

# python (pyspark)
type(flights) 
pyspark.sql.dataframe.DataFrame
# filters flights which are > 1000 miles long flights.filter('distance > 1000')

Select

This is reminiscent of SQL’s select keyword which allows you to choose columns.

# R 
iris %>% select(sepal.width, sepal.length)
# Python 
iris.loc[:, [["sepal_width", "sepal_length"]]]

Initially, I thought the following df[['col1', 'col2']] pattern would be a good map. But quickly realised we cannot do slices of the columns similar to select .

# R 
iris %>% select(Sepal.Length:Petal.Width)
# Python  
iris.loc[:, "sepal_length":"petal_width"]

A thing to note about the loc method is that it could return a series instead of a DataFrame when the selection is just one row. so you’ll have to slice it in order to return a dataframe.

# Python
iris.loc[1, :] # returns a Series 
iris.loc[[1],:] # returns a dataframe

But the really awesome thing about select , function its ability to unselect columns which is missing in the loc method.

You have to use the .drop() method.

# Python
df.drop(columns=["col1"])

Note I had to add the param columns because drop can not only be used to drop columns, the method can also drop rows based on their index.

Like filter , select is also used in pySpark!

# python (pySpark)
# shows the column xyz of the spark dataframe.
df.select("xyz").show()
# alternative  
df.select(df.xyz)

Arrange

The arrange function lets one sort the table by a particular column.

# R 
df %>% arrange(col1, decreasing=TRUE)
# Python
df.sort_values(by="col1", ascending=False) 
# everything is reversed in python fml.

Mutate

dplyr ’s mutate was really an upgrade from R’s apply .

NOTE : Other applies which is useful in R for example includes mapply and lapply

# R 
df %>% mutate(new = something / col2, newcol = col+1 )
# Python 
iris.assign(new = iris.sepal_width / iris.sepal, 
newcol = lambda x: x["col"] + 1 )

tidyverse ’s mutate function by default takes the whole column and does vectorised operations on it. If you want to apply the function row by row, you’ll have to couple rowwise with mutate .

# R
# my_function does not take vectorised input of the entire column # this will fail
iris %>% 
    mutate(new_column = my_function(sepal.width, sepal.length)) %>% 
    rowwise %>% 
    mutate(new_column = my_function(sepal.width, sepal.length))

To achieve the same using the .assign method, you should nest an apply inside the function.

# Python
def do_something_string(col): 
    #set_trace() 
    if re.search(r".*(osa)$", col): value = "is_setosa" else: value = "not_setosa" 
    return value iris = iris.assign( transformed_species = lambda df: df["species"] \ 
    .apply(do_something_string) )

If you’re lazy, you could just chain two anonymous functions together.

# Python
iris = iris.assign( transformed_species = lambda df: df.species.apply(do_something_string))

Apply

From R’s apply help docs:

apply(X, MARGIN, FUN, ...)

Where the value of MARGIN takes either 1 or 2 for (rows, columns), ie. if you want to apply to each row, you’ll set the axis as 0 .

However, in pandas axis refers to what values (index i or columns j) will be used for the applied functions input parameter’s index.

be using the 0 refers to the DataFrame’s index and axis 1 refers to the columns.

Tidying Up Pandas
Tidying Up Pandas

So if you wanted to carry out row-wise operations you could set axis to 0.

# R
df %>% apply(0, function(row){ ... do some compute ... })

Rarely do that now since plyr and later dplyr.

However there is no plyr in pandas. So we have to go back to using apply if you want row-wise operations, however, the axis now is 1 not 0. I initially found this very confusing. The reason is because row is a really just a pandas.Series whose index is the parent p andas.DataFame’s columns. Thus in this example, the axis is referring to which axis to set as the index.

# python 
iris.apply(lambda row: do_something(row), axis=1)

Interestingly pattern which I do not use in R, is to use apply on columns, in this case pandas.Series objects.

# python
iris.sepal_width.apply(lambda x: x**2) # if you want a fancy progress bar, you could use the tqdm function
iris.sepal_width.apply_progress(lambda x: x**2) # If u need parallel apply # this works with dask underneath  import swifter
iris.sepal_width.swifter.apply(lambda x : x**2)

In R, one of the common idioms, which I keep going back to for a parallel version of groupby is as follows:

# R
unique_list %>% 
lapply(function(x){ 
    ... 
    df %>% filter(col == x) %>% do_something() # do something to the subset 
    ... 
}) %>% do.call(rbind,.)

If you want a parallel version you’ll just have to change the lapply to mclapply .

Additionally, there’s mclapply from the parallel / snow library in R.

# R
ncores = 10 # the number of cores 
unique_list %>% 
    mclapply(function(x){ 
    ... 
    df %>% filter(col == x) %>% do_something() # do something to the subset 
    ... 
}, mc.cores=ncores) %>% 
do.call(rbind,.)

Separately, in pySpark, you can split the whole table into partitions and do the manipulations in parallel.

# Python (pyspark)
(
    dd.from_pandas(my_df,npartitions=nCores)
        .map_partitions(lambda df : df.apply( lambda x : nearest_street(x.lat,x.lon),axis=1))
        .compute(get=get) # imports at the end
)

To achieve the same, we could use dask , or an even higher level wrapper from using the swiftapply module.

# Python
# you can easily vectorise the example using by adding the `swift` method before `.apply` 
series.swift.apply()

Group by

The .groupby method in pandas is equivalent to R function dplyr::group_by returning a DataFrameGroupBy object.

In Tidyverse there’s the ungroup function to ungroup grouped DataFrames, in order to achieve the same, there does not exists a1-to-1 mappable function.

One way is to complete the groupby -> apply (two-step process) and feeding apply with an identity function apply(lambda x: x) . Which is an identity function.

Summarise

In pandas the equivalent of the summarise function is aggregate abbreviated as the agg function. And you will have to couple this with groupby , so it’ll similar again a two step groupby -> agg transformation.

# R
r_mt = mtcars %>% 
    mutate(model = rownames(mtcars)) %>% 
    select(cyl, model, hp, drat) %>% 
    filter(cyl < 8) %>% 
    group_by(cyl) %>% 
    summarise( 
        hp_mean = mean(hp),
        drat_mean = mean(drat), 
        drat_std = sd(drat), 
        diff = max(drat) - min(drat) 
    ) %>% 
    arrange(drat_mean) %>% 
    as.data.frame

The same series of transformation written in Python would follow:

# Python
def transform1(x): 
    return max(x)-min(x)
def transform2(x): 
    return max(x)+5 
    py_mt = ( 
        mtcars
            .loc[:,["cyl", "model", "hp", "drat"]] #select      
            .query("cyl < 8") #filter 
            .groupby("cyl") #group_by 
            .agg({ #summarise, agg is an abbreviation of aggregation     
                    'hp':'mean', 
                    'drat':[
                        'mean', 
                        'std', 
                        transform1, 
                        transform2] # hmmmmmm
            })
            .sort_values(by=[("drat", "mean")]) #multindex sort (unique to pandas) ) py_mt

Merge

Natively, R supports the merge function and similarly in Pandas there’s the pd.merge function.

# R 
df1 %>% merge(df2, by="common_column")
# Python
df1.merge(df2, how="inner", on="common_column")

Besides, the merge function, there’s also the join function. There exist other joins in dplyr for example left_join , right_join , inner_join and anti_join.

Inplace

In R there’s the compound assignment pipe-operator %<>% , which is similar to the inplace=True argument in some pandas functions but not all . :( Apparently Pandas is going to remove inplace altogether…

Debugging

In R, we have the browser() function.

# R 
unique(iris$species) %>% 
    lapply(function(s){ 
        browser() 
        iris %>% filter(species == s) 
        ....
     })

It’ll let you step into the function which is extremely useful if you want to do some debugging.

In Python, there’s the set_trace function.

# Python
from IPython.core.debugger import set_trace
(
    iris
        .groupby("species") 
        .apply(lambda groupedDF: set_trace()) 
)

Last but not least if you really need to use some R function you could always rely on the rpy2 package. For me I rely on this a lot for plotting. ggplot2 ftw!

# python
import rpy2            # imports the library 
%load_ext rpy2.ipython # load the magic

Sometimes there’s issues installing r packages using R. You can run

conda install -r r r-tidyverse r-ggplot

There after you can always use R and Python interchangeably in the same Jupyter notebook.

%%R -i python_df -o transformed_df
transformed_df = python_df %>% 
    select(-some_columns) %>% 
    mutate(newcol = somecol * 2)

NOTE: %%R is cell magic and %R is line magic.

If you need outputs to be printed like a normal pandas DataFrame, you can you the single percent magic

%R some_dataFrame %>% skim

Ellipsis

In R, one nifty trick you can do is to pass arguments to inner functions without ever having to define them in the outer function’s function signature.

# R 
#' Simple function which takes two parameters `one` and `two` and elipisis `...`,
somefunction = function(one, two, ...){ 
     three = one + two 
     sometwo = function(x, four){ 
         x + four 
     } 
     sometwo(three, ...) # four exists within the elipisis  
} 
# because of the elipisis, we can pass as many parameters as we we want. the extras will be stored in the elipisis
somefunction(one=2, two=3, four=5, name="wesley")

In python, **kwargs takes the place of  ... . Below is an explanation of how exactly it works.

Explanation

Firstly, the double asterisks ** is called unpack operator (it’s placed before a function signature eg. kwargs so together it’ll look like **kwargs ).

The convention is to let that variable be named kwargs (which stands for k ey w orded arguments) but it could be named anything.

Most articles which describe the unpack operator will start off with this explanation: where dictionaries are used to pass functions their parameters.

# Python
adictionary = { 'first' : 1, 'second': 2 } 
def some_function(first, second): 
    return first + second 
some_function(**adictionary) # which gives 3
Tidying Up Pandas
Tidying Up Pandas

But you could also twist this around and set **kwargs as a function signature. Doing this lets you key in an arbitrary number of function signatures when calling the function.

The signature-value pairs are wrapped into a dictionary named kwargs which is accessible inside the function.

# Python 
# dummy function which prints `kwargs`
def some_function (**kwargs): 
    print(kwargs) 
some_function(first=1, second=2)

The previous two cases are not exclusive, you could actually ~ mix ~ them together. Ie. have named signatures as well as a **kwargs .

# Python 
adictionary = { 'first' : 1, 'second': 2, 'useless_value' : "wesley" } 
def some_function(first, second, **kwargs): 
    print(kwargs) return first + second   print(some_function(**adictionary))

The output will be: {'useless_value': 'wesley'}

It allows a python function to accept as many function signatures as you supply it. Those which are already defined during the declaration of the function would be directly used. And those which do not appear within them can be accessed from kwargs.

By putting the **kwargs as an argument in the inner function, you’re basically unwrapping the dictionary into the function params.

# Python
def somefunction(one, two, **kwargs): 
    print(f"outer function:\n\t{kwargs}") 
    three = one + two 
def sometwo(x, four): 
    print(f"inner function: \n\t{kwargs}") 
    return x + four return
sometwo(three, **kwargs) 
somefunction(one=2, two=3, four=5, name=“wesley”)
outside function: {“four”:5, “name”:”wesley”} Inside inside kwargs: {'name': 'jw'}

Lets now compare this with the original R ellipsis:

# R 
#' Simple function which takes two parameters `one` and `two` and elipisis `...`,
somefunction = function(one, two, ...){ 
    three = one + two 
    sometwo = function(x, four){ 
        x + four 
    } 
    sometwo(three, ...) # four exists within the elipisis  
} 
# because of the elipisis, we can pass as many parameters as we we want. the extras will be stored in the elipisis
somefunction(one=2, two=3, four=5, name="wesley")

Conclusion

There’s many ways to do thing in pandas more so than the tidyverse way, and I wish this was clearer.

Additionally, something which caught me off guard after coming to Honestbee was the amount of SQL I need.

For example postgreSQL to query RDS and it’s dialect for querying Redshift, KSQL for querying data streams via Kafka and Athena’s query language build on top of presto DB for querying S3, where most of the data use to exist in parquet files.

The shows one big deviation from academia where data in a company is usually stored in a database / datalake / datastream whereas in academia its usually just one big flat data file.

We’ve come to the ending of this attempt at mapping tidyverse vocabularies to pandas, hope you’ve found this informative and useful! See you guys soon!