理解 word2vec
word2vec 是 Google 推出的用来做词表示的开源工具包。
在自然语言处理中,我们把一个句子当成词的集合,那么一篇文章所出现的所有词的集合就称之为词典 (vocabulary).
用 index 来表示,比如
用词向量 (vector) 来表示,假设总共有 k 个词, 这个词是第138个,那么它的表示就是一个
k-dimension vector
,在138位为1, 其他为0
这种表示称为 one-hot encoding
因此 word2vec 就出现了。
LDA 是一种应用广泛的主题模型,物理意义直接,数学形式上也很优美,并且用的是贝叶斯学派的框架。
不过在谈 LDA 之前,我们首先需要了解 Probabilistic Latent Semantic Analysis (PLSA), 它是主题模型的基础。
在 PLSA 的世界观中,一篇文档是个双层结构,第一层是主题,一篇文档里包含了若干个主题,第二层是词语,每个主题包含了若干个词语。
从文档 $d$ 生成词语 $w$ 的过程,就是先从文档生成主题 $z$,再从主题 $z$ 生成词语 $w$ 的过程。
假设这篇文档总共有 $N$ 个词, $K$ 个主题,并且词语都是独立的,那么用概率来描述这个生成模型的话:
$$ p(w|d) = \prod_{n=1}^N \sum_{k=1}^K p(w_n|z_k) p(z_k|d) $$
如果假设文档的独立性,我们也很容易写出整个语料的生成概率。PLSA 模型较简单,可以用 EM 算法来求解。
Sublime Text 中用 MathJax 写 Markdown 公式
2014-06-29Test for independence
ARMA Process is a widely used model of time series, which has a lot of good properties. We will show in this article that, by testing the residual of a sequence, we can test whether or not ARMA model can be applied.
The most important part is to build a test statistic of white with Ljung-Box method.
Consider a real valued time series $(Z_{k})_{k\in \mathbb{Z}}$, we want to build a statistical test of the hypothesis
$$ H_{0}={(Z_{k})_{k\in \mathbb{Z}} \mbox{is a white noise}}$$
$$ H_{1}={(Z_{k})_{k\in \mathbb{Z}} \mbox{is not a white noise}}$$
Let $\hat{\mu}_n$ be the empirical mean, $\hat{\gamma}_n$ be the empirical autocovariance function, $\hat{\rho}_n$ be the empirical autocorrelation function:
$$\hat{\gamma}=n^{-1}\sigma_{1\leq s,s+t\leq n}(Z_s-\hat{\mu}_n)(Z_{s+t}-\hat{\mu}_n)$$
The Ljung-Box statistical test at lag $h>1$ is then defined as
$$ T_n(h)=n(n+2)\sum_{t=1}^{h}\frac{\hat{(\rho}_n(t))^2}{n-t} $$
We now do this step by step.
Blind Separation of Instantaneous Mixtures of Nonstationary Sources
Most source separation algorithms are based on a model of stationary sources. However, it is a simple matter to take advantage of possible nonstationarities of the sources to achieve separation. This paper develops novel approaches in this direction based on the principles of maximum likelihood and minimum mutual information. These principles are exploited by efficient algorithms in both the off-line case (via a new joint diagonalization procedure) and in the on-line case (via a Newton-like procedure). Some experiments showing the good performance of our algorithms and evidencing an interesting feature of our methods are presented: their ability to achieve a kind of super-efficiency. The paper concludes with a discussion contrasting separating methods for non-Gaussian and nonstationary models and emphasizing that, as a matter of fact, “what makes the algorithms work” is strictly speaking—not the nonstationarity itself but rather the property that each realization of the source signals has a time-varying envelope.
Theoretical Basis
In this report, we will mainly investigate two approaches, namely Maximum Likelihood and Block Gaussian Likelihood, to build objective functions for blind separation problems, and later we will discuss their connections with Guassian Mutual Information. If not specified, we assume the following sources are non-stationary Gaussian sources.
Bayesian methods in ranking
2013-06-25Many websites use a ranking system to determine the display order of content, especially for Q&A or news websites like Quora, Reddit, zhihu, that for a post/question, there might be multiple comments/answers, and the users could use the upvote or downvote buton to gradually change the displaying order.
How to choose a good chart
2013-04-11Visualization is a very import part of data science. Andrew Abela posted a way for how to choose a good chart to help visualize data.
Grow a search result
2012-11-27“Grow A Search Result” is an organic kind of search that presents results that grow over time, drawing attention to the things about which you are most passionate.
When we think of the experiences that search engines are designed to support, criteria such as speed and efficiency instantly come to mind. However, one of our main interests is in how web use is intertwined with daily life, and understanding the activities in which search engines play a role. In addition to fast, relevant search, our search engine focus on: