Leveraged volume sampling for linear regression
Active Learning in linear regression with multiplicative error rate bounds
LinkThey want to find a linear regression that its loss (square loss) is at most \(1+\epsilon\) times the best regression. They show that they can achieve this with \(O\left(d \log d + \frac{d}{\epsilon}\right)\) labels with high probability.
Notation
input matrix \(X \in \mathbb{R}^{n \times d}\) has (full) rank \(d\). \(y \in \mathbb{R}^n\) is the target vector. The goal of the learner is to find a weight vector \(w \in \mathbb{R}^d\) that minimizes the square loss: \(w^* = arg\min_{w \in \mathbb{R}^d} \mathcal{L}(w),\) where \(\mathcal{L}(w) = \sum_{i=1}^n (x_i^\top w - y_i)^2 = \|X w - y\|_2^2.\)
Given both matrix \(X\) and vector \(y\), the least squares solution can be directly computed as \(w_* = X^+ y\), where \(X^+\) is the pseudo-inverse.
Standard volume sampling
Given \(X \in \mathbb{R}^{n \times d}\) and a size \(k \geq d\), standard volume sampling jointly chooses a set \(S\) of \(k\) indices in \([n]\) with probability
\[\Pr(S) = \frac{\det(X_S^T X_S)}{\binom{n-d}{k-d} \det(X^T X)}\]after getting target value of set \(S\), we set \(w^*_{S} = (X_S)^+ y_S\).
The main property of this approach is that \(\mathbb{E}[w^*_S] = w^*\).
They have shown that this approach can not grantee any \(1+\epsilon\) multiplicative error rate with out checking labels of half of the data points in some cases.
Rescaled volume sampling
Define \(l_i = x_i^{\top}(X^{\top}X)^{-1}x_i\) and \(q_i = \frac{l_i}{d}\).
Sample \(k\) row indices \(\pi_1, \ldots, \pi_k\) with replacement from \(\{1, \ldots, n\}\) according to
\[\Pr(\pi) \sim \det \left( \sum_{i=1}^k \frac{1}{q_{\pi_i}} x_{\pi_i} x_{\pi_i}^\top \right) \prod_{i=1}^k q_{\pi_i}.\]Define \(Q_{\pi} = \sum_{i=1}^k \frac{1}{q_{\pi_i}} \mathbf{e}_{\pi_i} \mathbf{e}_{\pi_i}^\top \in \mathbb{R}^{n \times n}\)
A good classifier is calculated from
\[w_*^{\pi} = \arg \min_{w \in \mathbb{R}^d} \left\|Q_{\pi}^{\frac{1}{2}} (X w - y) \right\|_2^2\]And if \(k \in O(d \ln (\frac{d}{\delta}) + \frac{d}{\epsilon \delta})\) the produced model will have a loss less than \(1+\epsilon\) times the best regression with probability at least \(1 - \delta\).