如何让机器理解我们的语言(二)人工智能的映像变迁

2015-10-14

作者按:本文是《如何让机器理解我们的语言》系列第二篇。谈论自然语言处理、机器理解很难绕开人工智能,故有此文。囿于作者才疏识浅,文章难免挂一漏万,多有谬误,鞠谢勘正。

#工匠的玩具
可能是太过于孤独的缘故,人类很早就开始了对人工智能的想象。大约在公元前900年,在中国的西周时期,据载有个巧匠就发明了一个神奇的机器舞姬,第一是外形和常人无异:

周穆王西巡狩……道有献工名偃师……王荐之,曰:‘若与偕来者何人邪?’对曰:‘臣之所造能倡者。’穆王惊视之,趋步俯仰,信人也。

第二是能歌善舞:

巧夫颔其颐,则歌合律;捧其手,则舞应节。千变万化,惟意所适。”因而周穆王“以为宝人也,与盛姬内御并观之。

第三是还能调戏王的女人:

技将终,倡者瞬其目而招王之左右侍妾。

MORE...

如何让机器理解我们的语言(一)语言与计算语言学

2015-09-04

#当你在谈话的时候,你在谈些什么?

语言可能是人类最早习得的后天技能之一,并且在绝大多数情况下,伴随着我们的一生。英国每日邮报调查发现,人类平均每天要说接近一万个词(女性会说的更多),假设一句简短的话平均包含10个词,那么,你也许每天不知不觉就已经说了上千句话了。可是,你真正知道你说了些什么吗?

碑铭体

MORE...

理解 word2vec

2015-01-28

简介

word2vec 是 Google 推出的用来做词表示的开源工具包。
在自然语言处理中,我们把一个句子当成词的集合,那么一篇文章所出现的所有词的集合就称之为词典 (vocabulary).
在词典中表示一个具体的词,可以有两张办法:

用 index 来表示,比如 第138个词

或者

用词向量 (vector) 来表示,假设总共有 k 个词, 这个词是第138个,那么它的表示就是一个 k-dimension vector,在138位为1, 其他为0

这种表示称为 one-hot encoding

但是这种表示除了能让我们索引到词以外,并不能提供任何其他的信息,我们希望一种更好的词表达方式,能够或多或少地表达这个词的意义。

因此 word2vec 就出现了。

让我们首先来思考,如何(在数学上)表达一个词的意义?

MORE...

LDA 和 PLSA

2014-07-03

主题模型是文本挖掘中一项重要的技术,它能够挖掘隐式主题,进而实现围绕主题的文本标注。
得到了这些主题相关的标注之后,后续的相关操作如文本组织、摘要、搜索都将大大简化。

LDA 是一种应用广泛的主题模型,物理意义直接,数学形式上也很优美,并且用的是贝叶斯学派的框架。
不过在谈 LDA 之前,我们首先需要了解 Probabilistic Latent Semantic Analysis (PLSA), 它是主题模型的基础。

PLSA

在 PLSA 的世界观中,一篇文档是个双层结构,第一层是主题,一篇文档里包含了若干个主题,第二层是词语,每个主题包含了若干个词语。
从文档 $d$ 生成词语 $w$ 的过程,就是先从文档生成主题 $z$,再从主题 $z$ 生成词语 $w$ 的过程。
假设这篇文档总共有 $N$ 个词, $K$ 个主题,并且词语都是独立的,那么用概率来描述这个生成模型的话:

$$ p(w|d) = \prod_{n=1}^N \sum_{k=1}^K p(w_n|z_k) p(z_k|d) $$

如果假设文档的独立性,我们也很容易写出整个语料的生成概率。PLSA 模型较简单,可以用 EM 算法来求解。

MORE...

Test for independence

2014-03-27

Abstract

ARMA Process is a widely used model of time series, which has a lot of good properties. We will show in this article that, by testing the residual of a sequence, we can test whether or not ARMA model can be applied.
The most important part is to build a test statistic of white with Ljung-Box method.

Background

Consider a real valued time series $(Z_{k})_{k\in \mathbb{Z}}$, we want to build a statistical test of the hypothesis
$$ H_{0}={(Z_{k})_{k\in \mathbb{Z}} \mbox{is a white noise}}$$
against
$$ H_{1}={(Z_{k})_{k\in \mathbb{Z}} \mbox{is not a white noise}}$$

Let $\hat{\mu}_n$ be the empirical mean, $\hat{\gamma}_n$ be the empirical autocovariance function, $\hat{\rho}_n$ be the empirical autocorrelation function:

$$\hat{\gamma}=n^{-1}\sigma_{1\leq s,s+t\leq n}(Z_s-\hat{\mu}_n)(Z_{s+t}-\hat{\mu}_n)$$
$$\hat{\rho}_n=\frac{\hat{\gamma}_n}{\gamma}$$

The Ljung-Box statistical test at lag $h>1$ is then defined as

$$ T_n(h)=n(n+2)\sum_{t=1}^{h}\frac{\hat{(\rho}_n(t))^2}{n-t} $$

We now do this step by step.

MORE...

Blind Separation of Instantaneous Mixtures of Nonstationary Sources

2013-10-27

Abstract

Most source separation algorithms are based on a model of stationary sources. However, it is a simple matter to take advantage of possible nonstationarities of the sources to achieve separation. This paper develops novel approaches in this direction based on the principles of maximum likelihood and minimum mutual information. These principles are exploited by efficient algorithms in both the off-line case (via a new joint diagonalization procedure) and in the on-line case (via a Newton-like procedure). Some experiments showing the good performance of our algorithms and evidencing an interesting feature of our methods are presented: their ability to achieve a kind of super-efficiency. The paper concludes with a discussion contrasting separating methods for non-Gaussian and nonstationary models and emphasizing that, as a matter of fact, “what makes the algorithms work” is strictly speaking—not the nonstationarity itself but rather the property that each realization of the source signals has a time-varying envelope.

Introduction

Theoretical Basis

In this report, we will mainly investigate two approaches, namely Maximum Likelihood and Block Gaussian Likelihood, to build objective functions for blind separation problems, and later we will discuss their connections with Guassian Mutual Information. If not specified, we assume the following sources are non-stationary Gaussian sources.

MORE...

Bayesian methods in ranking

2013-06-25

Many websites use a ranking system to determine the display order of content, especially for Q&A or news websites like Quora, Reddit, zhihu, that for a post/question, there might be multiple comments/answers, and the users could use the upvote or downvote buton to gradually change the displaying order.

MORE...

How to choose a good chart

2013-04-11

Visualization is a very import part of data science. Andrew Abela posted a way for how to choose a good chart to help visualize data.

MORE...

Grow a search result

2012-11-27

“Grow A Search Result” is an organic kind of search that presents results that grow over time, drawing attention to the things about which you are most passionate.

When we think of the experiences that search engines are designed to support, criteria such as speed and efficiency instantly come to mind. However, one of our main interests is in how web use is intertwined with daily life, and understanding the activities in which search engines play a role. In addition to fast, relevant search, our search engine focus on:

MORE...