Data analysis workflow

Motivation In data analysis, a REPL (read-evaluate-print-loop) is great for asking questions quickly about a dataset. As opposed to writing code using a compiler, an interactive environment lets us hone in on the right questions, which is of great importance for exploratory work. A typical REPL session — e.g., using R, Pig, Hive, Python — involves loading the dataset and performing a combination of filtering, projection, and transformation. Finding the appropriate combination may require many iterations, such as tweaking the various parameters used in filtering. The process of repeating almost the same commands can be quite tedious, especially dealing with wraparound lines. Some potential workflows are: 1) Write script in a text […]

More

SVD and low-rank approximation

The singular value decomposition (SVD) factorizes a matrix where are unitary and is diagonal with nonnegative entries. Unlike the eigenvalue decomposition, every (real or complex) matrix is guaranteed to have a SVD. Geometrically, the SVD implies that every linear transformation is equivalent to a rotation, a scaling, then another rotation. In data analysis, the SVD can be computed to perform principal component analysis (PCA), where an ellipsoid is used to “fit” a data matrix. In other words, once the SVD is computed, a partial SVD, can be used to approximate the matrix, where and are the left and right singular vectors, and are the singular values in decreasing order. Equivalently, a partial […]

More