蓝天,小湖,湖水中一方小筑

MMDS Notes: W2 - Locality-Sensitive Hashing

Locality-Sensitive Hashing,LSH,局部敏感hash或叫位置敏感hash。它的想法是在对原始数据空间的数据做Hash后,让位置相邻的数据有很大概率被放到同一个或者相近的bucket中,而不相邻的点放在一起的概率要很小。这样就会减少后期数据处理的数据集,从而简化后续的工作。

Set timezone in Python

今天在写一个脚本的时候,发现使用datetime.datetime.now()输出的是UTC时间,而同样的命令在ipython中输入的就是本地的时间。找了好久才找到不用pytz的解决方案:

MMDS Notes: W1 - HDFS & MR

前段时间在Cousera上各种挤时间跟完了一门 MMDS ,手上留下了一堆笔记,整理下,顺便给新blog开光吧。

课程总共7周,这篇整理的第一周的 HDFSMR 部分。

Control Goroutines amount via bufferred channel

最近还是在写爬虫,然后发现用goroutine是很快,但是很容易就碰到并发数过多被服务器限制的问题。虽然说让goroutine在起来前睡一小会能解决一些问题 ,但是终归感觉这样的办法不靠谱。继续翻文档发现bufferred channel用在这不错。

Golang and JSON API

最近在尝试用golang做爬虫类的东西,避免不了需要处理JSON API。其间碰到了些问题,记在这里以便下次查阅。

goroutines + channel

channel是golang里面一个比较有意思的东西,可以把它看成是一个semaphone(无缓存版队列)或者FIFO(有缓存版队列)。这篇文章只是把最 近用到的一些东西归纳了一下,就算是给自己留份存档吧。

Ajax loading multi series to jqPlot

The Ajax example of jqPlot only shows how to plot on series data, but the request I have met needs multi series. The solution is easy, just record here for later reference. In the example, The function used to load ajax data is ajaxDataRenderer, which returns array of data. For multi series, just return more than one data array. Here is a sample data set: [ [ [1,1],[2,2],[3,3],[4,4],[5,5] ], [ [5,1],[4,2],[3,3],[2,4],[1,5] ] ]

Some failed attempts on PNaCl

Google has released his [PNaCl](http://www.chromium.org/nativeclient/pnacl /building-and-testing-portable-native-client) project on Google I/O 2013, which can allow user to write portable native client application. The portable native client application can be translated to native client program and executed on supported architecture.

The official toolchain contains clang frontend, which can only be used to compile C/C++ code to PNaCl application. But the PNaCl application is a subset of LLVM so I have tried some other languages have LLVM frontend. I am still working on the solution, and this article is just recording some failed attempts.