Google Summer of Code and Pander - Part I
For the last 2 years, I have been lucky enough to be selected and participate in Google Summer of Code with R Project, working on different aspects of pander package. I felt that some kind of review of my experience and work done is getting a bit overdue, so I decided to do a 2-part write up about original plans, achieved results and gained experiences participating in GSoC. I hope that those 2 articles serve as a reasonable example to people who are considering participating as to one of the possible projects to participate in GSoC possible.
What is pander?
pander is an R package mainly developed and maintained by Gergely Daroczi that provides helper functions and a generic S3 method to render various types of R objects into pandoc's markdown format that can be converted further to HTML, PDF, docx, odt and other document formats that pandoc supports. A quick example of power of pander
:
> data(warpbreaks, package = "datasets")
> ct <- CrossTable(warpbreaks$wool, warpbreaks$tension,
+ dnn = c("Wool", "Tension"))
> pander(ct, plain.ascii = TRUE)
------------------------------------------------------------
Tension
Wool L M H Total
------------ ----------- ----------- ----------- -----------
A
N 9 9 9 27
Chi-square 0 0 0
Row(%) 33.3333% 33.3333% 33.3333% 50.0000%
Column(%) 50% 50% 50%
Total(%) 16.6667% 16.6667% 16.6667%
B
N 9 9 9 27
Chi-square 0 0 0
Row(%) 33.3333% 33.3333% 33.3333% 50.0000%
Column(%) 50% 50% 50%
Total(%) 16.6667% 16.6667% 16.6667%
Total 18 18 18 54
33.3333% 33.3333% 33.3333%
------------------------------------------------------------
The package can be used as a standalone tool for literate programming building on the traditions of brew, or can be also used inside of e.g. knitr or other markdown-based tools to render nice tables and other textual forms. It provides a wide variety of different options for rendering markdown and supports 67 different R classes.
Why pander?
Prior to GSoC 2014 I was had some reasonable exposure to R programming. During my undergraduate at Kyiv Polytechnic Institute, I took some online courses about data science that used R and even did a project of classifying spam using Naive Bayes and SVM (which was rather easy task to do in R). Moreover, when I started my graduate studies at Purdue University, I got involved with S3 lab and fastR. So taking all this into consideration, I was looking for something to (a) use my previous experience with R and (b) get to use R from different dimension, so GSoC and pander looked like a great fit.
My plan for GSoC 2014
While I worked on many small things here and there, and besides documentation and testing, my main goals stated in the proposal for the summer were:
- extend generic S3 method with new classes
- absolute and relative column width, hyphenation with koRpus
- optimize pandoc.table
My full application is available here.
Adding new S3 methods
First, I have started with extending generic S3 method. Here is the list of methods that I have implemented:
- pander.CrossTable
- pander.ts
- pander.formula
- pander.zoo
- pander.lme
- pander.coxph
- pander.survdiff
- pander.mtable
- pander.survfit
- pander.stat.table
- pander.smooth.spline
- pander.clogit
- pander.rlm
- pander.microbenchmark
This was a good way for me to get up to speed with pander
. Typically, pander.*
methods were based on respective print.*
methods with some modifications for before calling pandoc.table
. However, some of the methods, for example pander.CrossTable
and pander.mtable
required a complete rewrite to naturally fit with how table can be represented with markdown. Generally, most complications would come would come from lack of support of row- and col-spanning cells in markdown or multi-line cells.
Adding configurable column width
After getting more familiar with pander
, my next step was to implement configurable width support. Prior to that pander
supported single value for all the columns, which was a bit limiting in some cases. Roughly the work was divided into 3 parts:
1) Allowing to specify max width of the columns separately - Pull request #90.
The implementation was based around such principles:
- If
split.cells
is just a number, behavior is the same as before. - First value in
split.cells
goes for row names, if there are no row names, then it goes for first param. - If
length(split.cells) == number of columns
, then there is one max.width for every column and for rownames (if they exists) default value will be used - If
length(split.cells) == number of columns + 1
, then first value in split.cells if for row names, and others are for columns. - All extra value in
split.cells
vector are discarded. - If
split.cells
does not have enough values for all columns, warning is produced and default values are added for other columns.
The examples of all the cases can be found in the pull request.
2) Break lines in cells, even in case when there are no white spaces, use hyphenation where needed - Pull request #93
Next, I have added support for easier handling of multi-line cells using hyphening when needed. It was implemented by using hyphenation algorithms from koRpus package with some manipulation of text based on nchar
.
Generally it follows all the same rules as split of large cells, just when use.hyphening
option is enabled, it will try to split.words using line breaks also.
d> df <- data.frame(a="Machiavellianism", b="Hypervitaminosis")
d> pander(df, split.cells=10)
---------------------------------
a b
---------------- ----------------
Machiavellianism Hypervitaminosis
---------------------------------
d> pander(df, split.cells=10, use.hyphening = TRUE)
-------------------
a b
--------- ---------
Machi- Hypervit-
avellian- aminosis
ism
-------------------
3) Allowing to specify relative width of columns - Pull request #96.
Lastly, I have added support of supplying split.cells
as a vector with percentage values. It is relative to split.tables
parameter, meaning that in that case split.tables
becomes a max.width
for a table. The implementation imposes a requirement that value is supplied for every column and rownames and that relative width sums to 100%. After the work done in previous parts, this particular change was rather straightforward, since it was mostly about converting percentages to numbers and using work done in Pull request #90.
Rewriting core functionality with Rcpp
In summer 2014 I attended useR 2014 in LA and went to 2 workshops on Rcpp. I got really impressed with the power that Rcpp gives and was eager to use it. Since I was planning to work on improving performance of pandoc.table
and had some experience with C++ that also seemed to be a great fit.
I did the changes required in 2 parts:
1) I implemented Rcpp version of splitLine
in PR #101. splitLine
was a helper function that broke the long lines into multiple lines to allow multi-line cells and help enforce predefined cell width. Code for Rcpp version looks more or less similar to the one in R, just a bit more modular. Most hard thing about writing things in Rcpp is debugging, since as far as I know there is no "IDE-like" experience for that, which leaves either printing or using gdb/lldb
.
2) After that, I took on the implementing table.expand
in PR #105. I focused on table.expand
, because despite it's rather small size, it does all the work for actually printing the table (everything else in pandoc.table
are arguments processing, sanity checks and preparation). Since originally Gergely wrote table.expand in pretty idiomatic R heavily using vector functions like sapply/lapply
, which made the task for translating it into Rcpp a bit harder and thus more interesting. For example, I had to write a simplified equivalent of format
, since I couldn't any C++ equivalent that supported justification.
The performance improvement for table.expand
was great (the minimal I saw what 10x, with average of 15x):
> b1
Unit: microseconds
expr min lq median uq max neval
table.expand 711.426 743.061 810.3655 863.430 2219.897 100
tableExpand_cpp 57.685 64.676 84.1530 100.071 476.688 100
> b2
Unit: microseconds
expr min lq median uq max neval
table.expand 1929.499 1989.7145 2066.2760 2599.152 4317.919 100
tableExpand_cpp 58.566 65.9905 85.2165 96.538 167.777 100
> b3
Unit: microseconds
expr min lq median uq max neval
table.expand 1366.966 1405.690 1455.6460 1502.291 3627.009 100
tableExpand_cpp 55.687 67.638 87.1345 96.651 190.005 100
But I wasn't that much satisfied with 1.5 - 5x speed up achieved for pandoc.table.return
. For example:
- Small data.frame
> b.old.simple
Unit: milliseconds
expr min lq median uq max neval
pander(data.frame(a = "foo\\nbar"), keep.line.breaks = TRUE) 5.501682 6.224382 7.290573 8.382554 12.23485 100
> b.new.simple
Unit: milliseconds
expr min lq median uq max neval
pander(data.frame(a = "foo\\nbar"), keep.line.breaks = TRUE) 2.30613 2.587211 2.924714 3.806796 5.605237 100
- mtcars
> b.old.mtcars
Unit: milliseconds
expr min lq median uq max neval
pander(mtcars, style = "simple") 192.7679 201.4737 206.0359 210.8648 282.0304 100
> b.new.mtcars
Unit: milliseconds
expr min lq median uq max neval
pander(mtcars, style = "simple") 66.4057 73.86317 76.89688 83.96956 119.5512 100
- volcano
> b.old.volcano
Unit: seconds
expr min lq median uq max neval
pander(volcano) 4.408776 4.415709 4.479123 4.513334 4.536083 10
> b.new.volcano
Unit: seconds
expr min lq median uq max neval
pander(volcano) 3.502687 3.545587 3.583752 3.634691 3.850749 10
I think that the main causes for the less speed-up is the recursive call in pandoc.table.return
when splitting large table.
Aftermath
Some statistics of my work during this awesome experience:
- 127 commits
- 40 accepted pull requests
- 14 new pander methods
- 4026 lines inserted, 1459 lines deleted
- 4 bugs fixed
But those numbers don't reflect more important change that I think GSoC is about. I was able to participate in something that is largely used and was able to contribute to R community while learning a lot about R package development (for example documenting with roxygen, testing with testthat and CI with Travis). All in all, it was such a great experience that I was happy to return for GSoC 2015...