Data analysis workflow

Motivation

In data analysis, a REPL (read-evaluate-print-loop) is great for asking questions quickly about a dataset. As opposed to writing code using a compiler, an interactive environment lets us hone in on the right questions, which is of great importance for exploratory work.

A typical REPL session — e.g., using R, Pig, Hive, Python — involves loading the dataset and performing a combination of filtering, projection, and transformation. Finding the appropriate combination may require many iterations, such as tweaking the various parameters used in filtering. The process of repeating almost the same commands can be quite tedious, especially dealing with wraparound lines.

Some potential workflows are:
1) Write script in a text editor. Execute it on the command line.
During development, lines are typically appended to an existing script. So each script execution has a significant overlap with previous executions. This can be a bottleneck when loading the input or a particular transformation takes time to process. Executing a script also makes it harder to try out various transformation on the fly.

2) Write script in a text editor. Then copy and paste commands as needed into a REPL.
While this allows many iteration, the physical act of copy-and-paste becomes the most dominant part of script writing, overshadowing the thought process. Furthermore, copy-and-paste from one window to another is somewhat challenging in a ssh session when working remotely.

3) Write script in a text editor. Execute the current line of the cursor in a REPL in a separate window.
This workflow allows the flexibility of experimental work. Certain IDEs, such as RStudio, already has this functionality. However, if one is looking for a uniform experience across various languages, a simple shell experience can be achieved with the vim+tmux+slime combination, described next.

vim + tmux + slime

The basic idea is to split a shell into two panes divided by a horizontal line, where vim is in the top pane and a REPL session at the bottom. A keyboard shortcut then sends the current line under the cursor in vim to the REPL. (Selecting a block of lines to be evaluated is also an option.)

1) tmux
tmux has many features. Here we just focus on our particular use case. Once installed, create a new session and split the current window into two panes:

tmux new -s session-name
tmux split-window

2) vim-slime
Originally an Emacs plugin, vim-slime is the tool that sends a block of lines in vim directly to another tmux pane. Once installed, adding these lines to .vimrc enables this powerful trick:

let mapleader="\"
map t :SlimuxREPLSendLine
vmap t :SlimuxREPLSendSelection

In this example, in vim’s normal mode, the keystroke “space” followed by “t” sends either the current line or the highlighted block of lines, if it exists, to another pane. (For the first usage, there will be a prompt that asks which pane should receive the signal from vim.)

Lastly, the vim-tmux-navigator plugin allows seamless navigation among vim split windows and tmux panes.

Here is an animated screenshot that sends three lines in vim to be evaluated in R:

data-analysis-workflow

Leave a Reply

Your email address will not be published.