Roman Tsegelskyi's blog    About    Atom    RSS

Google Summer of Code and Pander - Part I

For the last 2 years, I have been lucky enough to be selected and participate in Google Summer of Code with R Project, working on different aspects of pander package. I felt that some kind of review of my experience and work done is getting a bit overdue, so I decided to do a 2-part write up about original plans, achieved results and gained experiences participating in GSoC. I hope that those 2 articles serve as a reasonable example to people who are considering participating as to one of the possible projects to participate in GSoC possible.

What is pander?

pander is an R package mainly developed and maintained by Gergely Daroczi that provides helper functions and a generic S3 method to render various types of R objects into pandoc's markdown format that can be converted further to HTML, PDF, docx, odt and other document formats that pandoc supports. A quick example of power of pander:

> data(warpbreaks, package = "datasets")
> ct <- CrossTable(warpbreaks$wool, warpbreaks$tension,
+                  dnn = c("Wool", "Tension"))

> pander(ct, plain.ascii = TRUE)

------------------------------------------------------------
               Tension                                      
    Wool          L           M           H         Total   
------------ ----------- ----------- ----------- -----------
     A                                                      
     N           9           9           9          27      
Chi-square       0           0           0                  
  Row(%)     33.3333%    33.3333%    33.3333%    50.0000%   
 Column(%)      50%         50%         50%                 
  Total(%)    16.6667%    16.6667%    16.6667%              

     B                                                      
     N           9           9           9          27      
Chi-square       0           0           0                  
  Row(%)     33.3333%    33.3333%    33.3333%    50.0000%   
 Column(%)      50%         50%         50%                 
  Total(%)    16.6667%    16.6667%    16.6667%              

   Total         18          18          18          54     
              33.3333%    33.3333%    33.3333%              
------------------------------------------------------------

The package can be used as a standalone tool for literate programming building on the traditions of brew, or can be also used inside of e.g. knitr or other markdown-based tools to render nice tables and other textual forms. It provides a wide variety of different options for rendering markdown and supports 67 different R classes.

Why pander?

Prior to GSoC 2014 I was had some reasonable exposure to R programming. During my undergraduate at Kyiv Polytechnic Institute, I took some online courses about data science that used R and even did a project of classifying spam using Naive Bayes and SVM (which was rather easy task to do in R). Moreover, when I started my graduate studies at Purdue University, I got involved with S3 lab and fastR. So taking all this into consideration, I was looking for something to (a) use my previous experience with R and (b) get to use R from different dimension, so GSoC and pander looked like a great fit.

My plan for GSoC 2014

While I worked on many small things here and there, and besides documentation and testing, my main goals stated in the proposal for the summer were:

My full application is available here.

Adding new S3 methods

First, I have started with extending generic S3 method. Here is the list of methods that I have implemented:

This was a good way for me to get up to speed with pander. Typically, pander.* methods were based on respective print.* methods with some modifications for before calling pandoc.table. However, some of the methods, for example pander.CrossTable and pander.mtable required a complete rewrite to naturally fit with how table can be represented with markdown. Generally, most complications would come would come from lack of support of row- and col-spanning cells in markdown or multi-line cells.

Adding configurable column width

After getting more familiar with pander, my next step was to implement configurable width support. Prior to that pander supported single value for all the columns, which was a bit limiting in some cases. Roughly the work was divided into 3 parts:

1) Allowing to specify max width of the columns separately - Pull request #90.

The implementation was based around such principles:

  • If split.cells is just a number, behavior is the same as before.
  • First value in split.cells goes for row names, if there are no row names, then it goes for first param.
  • If length(split.cells) == number of columns, then there is one max.width for every column and for rownames (if they exists) default value will be used
  • If length(split.cells) == number of columns + 1, then first value in split.cells if for row names, and others are for columns.
  • All extra value in split.cells vector are discarded.
  • If split.cells does not have enough values for all columns, warning is produced and default values are added for other columns.

The examples of all the cases can be found in the pull request.

2) Break lines in cells, even in case when there are no white spaces, use hyphenation where needed - Pull request #93

Next, I have added support for easier handling of multi-line cells using hyphening when needed. It was implemented by using hyphenation algorithms from koRpus package with some manipulation of text based on nchar. Generally it follows all the same rules as split of large cells, just when use.hyphening option is enabled, it will try to split.words using line breaks also.

d> df <- data.frame(a="Machiavellianism", b="Hypervitaminosis")
d> pander(df, split.cells=10)

---------------------------------
       a                b        
---------------- ----------------
Machiavellianism Hypervitaminosis
---------------------------------
d> pander(df, split.cells=10, use.hyphening = TRUE)

-------------------
    a         b    
--------- ---------
 Machi-   Hypervit-
avellian- aminosis 
   ism             
-------------------

3) Allowing to specify relative width of columns - Pull request #96.

Lastly, I have added support of supplying split.cells as a vector with percentage values. It is relative to split.tables parameter, meaning that in that case split.tables becomes a max.width for a table. The implementation imposes a requirement that value is supplied for every column and rownames and that relative width sums to 100%. After the work done in previous parts, this particular change was rather straightforward, since it was mostly about converting percentages to numbers and using work done in Pull request #90.

Rewriting core functionality with Rcpp

In summer 2014 I attended useR 2014 in LA and went to 2 workshops on Rcpp. I got really impressed with the power that Rcpp gives and was eager to use it. Since I was planning to work on improving performance of pandoc.table and had some experience with C++ that also seemed to be a great fit.

I did the changes required in 2 parts:

1) I implemented Rcpp version of splitLine in PR #101. splitLine was a helper function that broke the long lines into multiple lines to allow multi-line cells and help enforce predefined cell width. Code for Rcpp version looks more or less similar to the one in R, just a bit more modular. Most hard thing about writing things in Rcpp is debugging, since as far as I know there is no "IDE-like" experience for that, which leaves either printing or using gdb/lldb.

2) After that, I took on the implementing table.expand in PR #105. I focused on table.expand, because despite it's rather small size, it does all the work for actually printing the table (everything else in pandoc.table are arguments processing, sanity checks and preparation). Since originally Gergely wrote table.expand in pretty idiomatic R heavily using vector functions like sapply/lapply, which made the task for translating it into Rcpp a bit harder and thus more interesting. For example, I had to write a simplified equivalent of format, since I couldn't any C++ equivalent that supported justification.

The performance improvement for table.expand was great (the minimal I saw what 10x, with average of 15x):

> b1
Unit: microseconds
                   expr      min      lq   median      uq      max neval
           table.expand  711.426 743.061 810.3655 863.430 2219.897   100
        tableExpand_cpp   57.685  64.676  84.1530 100.071  476.688   100
> b2
Unit: microseconds
                   expr        min        lq    median       uq      max neval
           table.expand   1929.499 1989.7145 2066.2760 2599.152 4317.919   100
        tableExpand_cpp     58.566   65.9905   85.2165   96.538  167.777   100
> b3
Unit: microseconds
                   expr      min       lq    median       uq      max neval
           table.expand 1366.966 1405.690 1455.6460 1502.291 3627.009   100
        tableExpand_cpp   55.687   67.638   87.1345   96.651  190.005   100

But I wasn't that much satisfied with 1.5 - 5x speed up achieved for pandoc.table.return. For example:

  • Small data.frame
> b.old.simple
Unit: milliseconds
                                                         expr      min       lq   median       uq      max neval
 pander(data.frame(a = "foo\\nbar"), keep.line.breaks = TRUE) 5.501682 6.224382 7.290573 8.382554 12.23485   100
> b.new.simple
Unit: milliseconds
                                                         expr     min       lq   median       uq      max neval
 pander(data.frame(a = "foo\\nbar"), keep.line.breaks = TRUE) 2.30613 2.587211 2.924714 3.806796 5.605237   100
  • mtcars
> b.old.mtcars
Unit: milliseconds
                             expr      min       lq   median       uq      max neval
 pander(mtcars, style = "simple") 192.7679 201.4737 206.0359 210.8648 282.0304   100
> b.new.mtcars
Unit: milliseconds
                             expr     min       lq   median       uq      max neval
 pander(mtcars, style = "simple") 66.4057 73.86317 76.89688 83.96956 119.5512   100
  • volcano
> b.old.volcano
Unit: seconds
            expr      min       lq   median       uq      max neval
 pander(volcano) 4.408776 4.415709 4.479123 4.513334 4.536083    10
> b.new.volcano
Unit: seconds
            expr      min       lq   median       uq      max neval
 pander(volcano) 3.502687 3.545587 3.583752 3.634691 3.850749    10

I think that the main causes for the less speed-up is the recursive call in pandoc.table.return when splitting large table.

Aftermath

Some statistics of my work during this awesome experience:

  • 127 commits
  • 40 accepted pull requests
  • 14 new pander methods
  • 4026 lines inserted, 1459 lines deleted
  • 4 bugs fixed

But those numbers don't reflect more important change that I think GSoC is about. I was able to participate in something that is largely used and was able to contribute to R community while learning a lot about R package development (for example documenting with roxygen, testing with testthat and CI with Travis). All in all, it was such a great experience that I was happy to return for GSoC 2015...

comments powered by Disqus