The Basel problem via Fourier series

The Basel problem asks to compute the summation of the reciprocal of the natural numbers squared, and it is known that

    \[\sum_{n=1}^{\infty} \frac{1}{n^2} = \frac{\pi^2}{6}.\]

While there are numerous known proofs (our presentation resembles Proof 9 most closely), we give a self-contained one that motivates the usage of Fourier analysis for periodic functions.

Recall that a function f:\mathbb{R}\rightarrow\mathbb{C} is 2\pi-periodic if for every x \in \mathbb{R}, f(x+2\pi)=f(x). From Fourier analysis, every such function can be expressed as a series of cosines and sines. More precisely,

If f is continuous and 2\pi-periodic, then

    \[\left\Vert f - \sum_{n=-\infty}^{\infty} \hat{f}(n) e_n\right\Vert_2 = 0,\]

where e_n(x) = e^{inx}, and \hat{f}(n) is the n-th Fourier coefficient equal to \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)e^{-inx} dx.

We say that \sum_{n=-\infty}^{\infty} \hat{f}(n) e_n is the Fourier series of f. When we define the inner product of two functions to be \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x) \overline{g(x)} dx, then it is easily seen that the “characters” \{e_n\} form an orthonormal system, and the n-th Fourier coefficient can be viewed as the projection of f along the n-th character. The Fourier theorem asserts that f can be approximated in L_2-norm by its Fourier series. In fact, by applying the Weierstrass M-test, we can make the following statement with a stronger convergence guarantee:

If f is continuous and 2\pi-periodic and \sum_{n=-\infty}^{\infty} \left|\hat{f}(n)\right| is convergent, then its Fourier series \sum_{n=-\infty}^{\infty} \hat{f}(n) e_n converges uniformly to f.

Furthermore, the Fourier transform \hat{f} itself encodes many of the properties of the original function f. In particular, we note that the symmetry of f is preserved in its Fourier transform.

If f is even, then for every n \in \mathbb{Z}, \hat{f}(-n) = \hat{f}(n).
If f is odd, then for every n \in \mathbb{Z}, \hat{f}(-n) = -\hat{f}(n).

Now we have the machinery to compute \sum_{n=1}^{\infty} 1/n^2.

(Of Basel) Define f:[-\pi, \pi]\rightarrow \mathbb{R} to be f(x)=|x|. We compute the Fourier coefficient of f. First note that

    \[\hat{f}(0) = \frac{1}{2\pi}\int_{-\pi}^{\pi} |x| dx = \frac{\pi}{2}.\]

For n \neq 0 \in \mathbb{Z}, we have

    \[2\pi\hat{f}(n) = \int_{0}^{-\pi}xe^{-inx} dx + \int_{0}^{\pi}xe^{-inx} dx. \]

Using integration by parts, we have for every a, b \in \mathbb{R},

    \begin{eqnarray*} \int_a^b xe^{-inx}dx & = & \frac{x e^{-inx}}{-in}\Biggr|_a^b - \int_a^b \frac{e^{-inx}}{-in}dx \\ & = & \frac{x e^{-inx}}{-in}\Biggr|_a^b + \frac{e^{-inx}}{n^2}\Biggr|_a^b. \end{eqnarray*}


    \begin{eqnarray*} 2\pi\hat{f}(n) & = & \frac{-\pi(-1)^n}{-in} + \frac{(-1)^n -1 }{n^2} + \frac{\pi(-1)^n}{-in} + \frac{(-1)^n -1 }{n^2} \\ & = & \frac{2((-1)^n - 1)}{n^2}.  \end{eqnarray*}


    \[ \hat{f}(n) =  \begin{cases} \frac{-2}{\pi n^2} & \text{if $n$ is odd,} \\ 0 & \text{if $n$ is even and nonzero,} \\ \frac{\pi}{2} & \text{if $n=0$.}  \]

From Euler’s formula, the Fourier series of f can be written as

    \[  \hat{f}(0) + \sum_{n=1}^{\infty} \left[ (\hat{f}(n) + \hat{f}(-n)) \cos(nx) + i(\hat{f}(n) - \hat{f}(-n)) \sin(nx) \right]. \]

Since f is even, by Proposition, \hat{f}(-n)=\hat{f}(n). From our calculation, the Fourier series of f is explicitly

    \[  \frac{\pi}{2} + \sum_{n\geq1, n \text{ odd}} \frac{-4}{\pi n^2} \cos(nx). \]

Note that \sum_{n=1}^{\infty} 1/n^2 is convergent from the Cauchy condensation test. Thus, by the Convergence Theorem, we conclude that

    \[ \frac{\pi}{2} + \sum_{n\geq1, n \text{ odd}} \frac{-4}{\pi n^2} \cos(nx) \longrightarrow |x|, \]

where the convergence is uniform (and in particular, pointwise). Setting x=0, we see that

    \[ \sum_{n\geq1, n \text{ odd}} \frac{1}{n^2} = \frac{\pi^2}{8}. \]

Lastly, observe that

    \begin{eqnarray*} \sum_{n=1}^{\infty} \frac{1}{n^2} & = & \sum_{n\geq1, n \text{ even}} \frac{1}{n^2} + \sum_{n\geq1, n \text{ odd}} \frac{1}{n^2} \\ & = & \frac{1}{4} \sum_{n=1}^{\infty} \frac{1}{n^2} + \frac{\pi^2}{8}, \end{eqnarray*}

finishing the proof.

Auction algorithm for the assignment problem

The maximum bipartite matching is a well-known problem in graph theory, operations research, and economics. Also known by the name assignment problem, it models a marketplace with buyers and items, where every buyer has a valuation for each item, and we want to match (assign) each buyer to an item, with no item being shared. For social utility, an assignment that maximizes the total valuation is ideal.

In graph theory, the Hungarian algorithm by Kuhn produces a matching in polynomial time maximizing the total weight of the edges. Independently, Bertsekas from operations research and Demange, Gale, and Sotomayor from the economics perspective both use an underlying auction to solve the same problem. The auction provides additional intuition for this problem not transparent when using only graph-theoretic techniques. Here we describe a simple auction algorithm, largely following the presentation from this blog post.

To formulate this problem, we consider a bipartite graph B \times I, where B represents buyers and I represents items, with each edge (i,j) having an nonnegative weight v_{ij} representing the value Buyer i places on Item j.

A matching from B to I is a function M:B \rightarrow I such that for all i\neq j\in B, M(i)\neq M(j) or M(i)=M(j)=\emptyset.

The assignment problem is simply to find a matching function M such that its valuation \sum_{i\in B} v_{i,M(i)} is maximized. (Since it is possible that some buyers may not be assigned an item, we write v_{i,\emptyset}=0 for notational convenience.)

Below is the pseudocode of the auction algorithm. Intuitively, the algorithm is simulating a multi-item auction, where we put a price tag initialized to zero on each item. The queue Q represents the list of unmatched buyers, which is initialized to B. In each iteration, a buyer is greedily matched with his most valued item minus the price tag. However, each time an item is matched, it becomes slightly more expensive, becoming less valuable to other buyers in future rounds.

  Queue Q <- B 
  for j in I:
    p[j] <- 0
  BidirectionalMap M
  while !Q.isEmpty():
    i <- Q.pop()
    j <- Find j in I maximizing v[i][j] - p[j]

    if v[i][j]-p[j] >= 0:
      if M.ContainsValue(j):
      M(i) <- j
      p[j] <- p[j] + epsilon

  Output M, p
Consider the case when |I|=1, where a group of buyers is being matched for a single item. Clearly whoever values the item the most will win the auction, but at what price? From the highlighted line in the above pseudocode, it is readily seen that a buyer drops out when the price is above his value. Thus, the final sale price is at most the second highest valuation among all buyers (plus O(\epsilon)) . Hence, the above algorithm can be seen as a generalization of a second-price auction for multiple items.

Complexity. Before analyzing the running time, we first make the observation that the algorithm must terminate in finite steps. The key insight is that a buyer will not overpay for an item. In particular, writing C = \max_{i,j} v_{i,j}, we have

AUCTION(\epsilon) terminates within (C/\epsilon + 1) |I| iterations.

To see this, fix an item j. Note that before each increment to p_j, the inequality p_j \leq v_{ij} \leq C must hold. So the final price must satisfy p_j \leq C + \epsilon. The number of iterations in the loop is at most the number of times a price is incremented, which is (C/\epsilon + 1) |I|.

A trivial implementation that finds the item with maximum value takes linear time, leading to an implementation with a running time of O(C|I|^2/\epsilon). A faster algorithm can be achieved by maintaining |B| priority queues of size |I|, one for each buyer, as follows: Each time the price of Item j is incremented, d_j priority queues are updated, where d_j is the number of buyers who are connected to Item j in the bipartite graph. Then the total number of priority queue updates is

    \[ \sum_{j \in I} d_j \mbox{(\# of increments to $p_j$)} \leq \sum_{j \in I} d_j (C / \epsilon + 1). \]

By the handshaking lemma, \sum_{j \in I} d_j = 2m, where m is the number of edges in the bipartite graph. Thus, a priority queue-based implementation takes time at most O(Cm\log|I|/\epsilon).

Now we analyze the valuation produced from the matching M. The proof is short and self-contained but utilizes \epsilon-complementary slackness, a concept related to primal-dual method in optimization.

Let M be the matching produced by AUCTION(\epsilon) and M' be any matching from B to I. Then

    \[\sum_{i\in B} v_{iM(i)} \geq \sum_{i\in B} v_{iM'(i)} - \epsilon |B|. \]

Since we can choose \epsilon to be arbitrarily small, the theorem implies that output matching is arbitrarily close to optimal. In particular, if all the values v_{ij} are binary, setting \epsilon < 1/|B| implies that the algorithm terminates with a maximal matching, and its time complexity is cubic on dense graphs, on par with the Hungarian algorithm.

We say that Buyer i is \epsilon-happy under M if he is matched with an item that is \epsilon-close to optimal, i.e., for all j\in I, v_{iM(i)} - p_{M(i)} \geq v_{ij} - p_j - \epsilon. For notational convenience, we will write v_{i\emptyset}=p_{\emptyset}=0, so a buyer being completely out-priced (being matched with an “empty” item) is also \epsilon-happy.

The key insight is that during each iteration, the algorithm preserves the loop invariant that all buyers in B \setminus Q are \epsilon-happy. To see this, note that the invariant holds vacuously true initially. During each iteration, whenever a buyer is removed from the queue, he is always matched with an item whose price (after the update) is \epsilon-close to being optimal. Then the previous holder of the item is added back to the queue. Since all buyers in B are \epsilon-happy when the algorithm terminates, we can observe that for any matching M',

    \[ \sum_{i\in B} \left(v_{iM(i)} - p_{M(i)}\right) \geq \sum_{i\in B} \left(v_{iM'(i)} - p_{M'(i)} - \epsilon\right). \]

Re-arranging, it is easy to see that

    \[ \sum_{i\in B} v_{iM(i)} \geq \sum_{i\in B} v_{iM'(i)} - \epsilon|B| + \sum_{i\in B}(p_{M(i)} - p_{M'(i)}). \]

Thus, it suffices to show that \sum_{i\in B}(p_{M(i)} - p_{M'(i)}) \geq 0. If Item j is in the range of M' but not M, then the algorithm must have never incremented its price, so p_j=0. Thus,

    \[ \sum_{i\in B}(p_{M(i)} - p_{M'(i)}) = \sum_{j \in Range(M)\setminus Range(M')} p_j, \]

which is nonnegative.

The theorem also implies that in any marketplace, there is a set of market-clearing prices, i.e., each buyer has at most one preferred item. If all the valuation are integers, then a variant of the above algorithm from the DGS paper produces a price vector that is pointwise minimal among all market-clearing prices.

Data analysis workflow


In data analysis, a REPL (read-evaluate-print-loop) is great for asking questions quickly about a dataset. As opposed to writing code using a compiler, an interactive environment lets us hone in on the right questions, which is of great importance for exploratory work.

A typical REPL session — e.g., using R, Pig, Hive, Python — involves loading the dataset and performing a combination of filtering, projection, and transformation. Finding the appropriate combination may require many iterations, such as tweaking the various parameters used in filtering. The process of repeating almost the same commands can be quite tedious, especially dealing with wraparound lines.

Some potential workflows are:
1) Write script in a text editor. Execute it on the command line.
During development, lines are typically appended to an existing script. So each script execution has a significant overlap with previous executions. This can be a bottleneck when loading the input or a particular transformation takes time to process. Executing a script also makes it harder to try out various transformation on the fly.

2) Write script in a text editor. Then copy and paste commands as needed into a REPL.
While this allows many iteration, the physical act of copy-and-paste becomes the most dominant part of script writing, overshadowing the thought process. Furthermore, copy-and-paste from one window to another is somewhat challenging in a ssh session when working remotely.

3) Write script in a text editor. Execute the current line of the cursor in a REPL in a separate window.
This workflow allows the flexibility of experimental work. Certain IDEs, such as RStudio, already has this functionality. However, if one is looking for a uniform experience across various languages, a simple shell experience can be achieved with the vim+tmux+slime combination, described next.

vim + tmux + slime

The basic idea is to split a shell into two panes divided by a horizontal line, where vim is in the top pane and a REPL session at the bottom. A keyboard shortcut then sends the current line under the cursor in vim to the REPL. (Selecting a block of lines to be evaluated is also an option.)

1) tmux
tmux has many features. Here we just focus on our particular use case. Once installed, create a new session and split the current window into two panes:

tmux new -s session-name
tmux split-window

2) vim-slime
Originally an Emacs plugin, vim-slime is the tool that sends a block of lines in vim directly to another tmux pane. Once installed, adding these lines to .vimrc enables this powerful trick:

let mapleader="\"
map t :SlimuxREPLSendLine
vmap t :SlimuxREPLSendSelection

In this example, in vim’s normal mode, the keystroke “space” followed by “t” sends either the current line or the highlighted block of lines, if it exists, to another pane. (For the first usage, there will be a prompt that asks which pane should receive the signal from vim.)

Lastly, the vim-tmux-navigator plugin allows seamless navigation among vim split windows and tmux panes.

Here is an animated screenshot that sends three lines in vim to be evaluated in R:


SVD and low-rank approximation

The singular value decomposition (SVD) factorizes a matrix A=U\Sigma V^* where U,V are unitary and \Sigma is diagonal with nonnegative entries. Unlike the eigenvalue decomposition, every (real or complex) matrix is guaranteed to have a SVD. Geometrically, the SVD implies that every linear transformation is equivalent to a rotation, a scaling, then another rotation.

In data analysis, the SVD can be computed to perform principal component analysis (PCA), where an ellipsoid is used to “fit” a data matrix. In other words, once the SVD is computed, a partial SVD, A_k=\sum_{i=1}^{k} \sigma_i u_i v_i^* can be used to approximate the matrix, where u_i and v_i are the left and right singular vectors, and \sigma_i are the singular values in decreasing order. Equivalently, a partial SVD is obtained by replacing the smaller singular values with zero in the factorization. Intuitively, with the length of each ellipsoid’s component equal to its corresponding singular value, keeping only the components with non-trivial length makes the ellipsoid a good fit to the data. Formally, the Eckart-Young-Mirsky Theorem states that a partial SVD provides the best approximation to A among all low-rank matrices.

Let A and B be any m\times n matrices, with B having rank at most k. Then
\| A-A_k \|_2 \leq \| A-B \|_2,
\| A-A_k \|_F \leq \| A-B \|_F.

The theorem has a short proof for either norm, but here we describe a slightly longer proof using spectral method. Before proceeding, since we will work with singular values of different matrices, we will write \sigma_i(\cdot) to denote the i-th singular value of a matrix. We first state the following lemma, which provides an estimate on how much a low-rank perturbation to a matrix can affect its singular values.

Let A and B be any m\times n matrices, with B having rank at most k. \sigma_{i+k}(A) \leq \sigma_i(A-B).

The theorem is then an easy application of the above lemma.

(Of theorem) Recall that the \ell_2 norm of a matrix is equal to its largest singular value. So

    \[ \| A - A_k \|_2 = \sigma_1(A-A_k) = \sigma_{k+1}(A),\]

which by the lemma (setting i=1) is at most \sigma_1(A-B)=\| A-B \|_2.

For Frobenius norm, recall that the square of the Frobenius norm of a matrix is equal to the sum of the squares of its singular values. Applying the lemma above, we have

    \begin{eqnarray*} \| A-A_k \|_F^2 & = & \sum_{i=1}^{n} \sigma_i(A-A_k)^2 \\ & = & \sum_{i=1}^{n-k} \sigma_{i+k}(A)^2 \\ & \leq & \sum_{i=1}^{n}(A-B)^2 \\ & = & \| A - B \|_F^2, \end{eqnarray*}

finishing the proof.

The lemma is almost a re-statement of one of Weyl’s inequalities:

    \[ \sigma_{i+j-1}(X+Y) \leq \sigma_i(X) + \sigma_j(Y). \]

To see this, let X=A-B, Y=B, and j=k+1, and note that \sigma_{k+1}(B) is zero. Instead of invoking this inequality directly, we will prove the lemma from first principle.

(Of lemma) We first prove the case when i=1, i.e., \sigma_{k+1}(A) \leq \sigma_1(A-B).
Let V be the vector space spanned by the top k+1 right singular vectors v_1,\ldots,v_{k+1}, and W be the null space of B. By dimension argument, there must exist a non-zero vector w in V \cap W. Then bounding from above, we have

    \[ \|Aw\|_2 = \|(A-B)w\|_2 \leq \sigma_1(A-B) \|w\|_2. \]

Bounding from below and using the facts that w is orthogonal to v_i when i > k+1, Pythagorean Theorem, decreasing magnitude of singular values, and w lies in the orthonormal set \{v_1,\ldots,v_k\}, respectively, we obtain

    \begin{eqnarray*} \norm{Aw}_2^2 & = & \norm{\sum_{i=1}^{k+1} \sigma_i u_i \langle v_i, w\rangle}_2^2 \\ & = & \sum_{i=1}^{k+1} \sigma_i^2 \langle v_i, w \rangle ^2 \\ & \geq & \sigma_{k+1}^2 \sum_{i=1}^{k+1} \langle v_i, w \rangle ^2 \\ & = & \sigma_{k+1}^2 \|w\|_2^2. \end{eqnarray*}

Combining both sides yields \sigma_{k+1}(A) \leq \sigma_1(A-B).

Now we prove the general case. Using the i=1 case, we know that for any matrix C with rank at most i+k-1,

    \[ \sigma_{i+k}(A) \leq \sigma_1(A-C). \]

Writing C=C'+C'', we can write A-C = (A - B - C') + (B - C''). Since \sigma_1(X+Y) \leq \sigma_1(X) + \sigma(Y) (expressing \sigma_1 as the inner product \langle v, (X+Y)u \rangle and applying Cauchy-Schwarz, where u and v are the top left and right singular vectors of X+Y, respectively) , we have that

    \[ \sigma_{i+k}(A) \leq \sigma_1(A-B-C') + \sigma_1(B-C''). \]

Since the above inequality holds for arbitrary C with rank at most i+k-1, we can choose C'=(A-B)_{i-1} and C''=B_k. Then we can conclude that

    \[ \sigma_{i+k}(A) \leq \sigma_{i}(A-B) + \sigma_{k+1}(B), \]

completing the proof since \sigma_{k+1}(B) is zero.