Friday, December 30, 2011

Daily Bookmarks 20111230

Can Your Programming Language Do This? - Joel on Software
http://www.joelonsoftware.com/items/2006/08/01.html
Procs: Run Python Functions in Parallel Processes (Archived domnit.org blog)
http://domnit.org/blog/2007/10/procs.html
Some Notes on Tim Bray's Wide Finder Benchmark
http://effbot.org/zone/wide-finder.htm
http://domnit.org/misc/wf-proc.py
http://domnit.org/misc/wf-proc.py
Some thoughts on concurrency « Isotoma Blog
http://blog.isotoma.com/2008/05/some-thoughts-on-concurrency/
Wide Finder
http://dalkescientific.com/writings/diary/archive/2007/10/07/wide_finder.html#wf_effbot
other Wide Finder implementations
http://www.dalkescientific.com/writings/diary/archive/2007/10/10/other_wide_finder_implementations.html
Bill de hÓra: wfinder_serial.py
http://www.dehora.net/journal/2007/10/wfinder_serialpy.html


http://effbot.org wide finder - Google 搜尋
https://www.google.com/search?q=http://effbot.org+wide+finder&hl=zh-TW&client=firefox-a&hs=qpz&rls=org.mozilla:zh-TW:official&prmd=imvns&ei=EZ78TpTiF8famAXoxpyeAg&start=10&sa=N&biw=1132&bih=595

中文比对 - 动态感觉 静观其变 - 歪酷博客 Ycool Blog
http://xlp223.ycool.com/post.1465895.html
Coding Horror: Exploring Wide Finder
http://www.codinghorror.com/blog/2008/06/exploring-wide-finder.html
girtby.net – Wide Finder 2: The Widening
http://girtby.net/archives/2008/07/03/wide-finder-2-the-widening/

编程珠玑 Programming Pearls 学习笔记(一) | 梦想家的Blog
http://www.hoopercao.com/2011/01/25/%E7%BC%96%E7%A8%8B%E7%8F%A0%E7%8E%91-programming-pearls-%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0%EF%BC%88%E4%B8%80%EF%BC%89/
百度面试题:如何找出字典中的兄弟单词_IT面试题_百度空间 hash map
http://hi.baidu.com/mianshiti/blog/item/33590e3786e89c305bb5f592.html
编程珠玑 Programming Pearls 学习笔记(二) | 梦想家的Blog
http://www.hoopercao.com/2011/01/26/%e7%bc%96%e7%a8%8b%e7%8f%a0%e7%8e%91-programming-pearls-%e5%ad%a6%e4%b9%a0%e7%ac%94%e8%ae%b0%ef%bc%88%e4%ba%8c%ef%bc%89/
百度面试题:正向最大匹配分词,怎么做最快?_IT面试题_百度空间
http://hi.baidu.com/mianshiti/blog/item/957d2af0bd50afc00b46e079.html
美丽的Hash_liangrt_fd的空间_百度空间
http://hi.baidu.com/liangrt_fd/blog/item/3f034742d28123046a63e51c.html
迅雷面试题:合并用户基本信息和看电影的记录_IT面试题_百度空间
http://hi.baidu.com/mianshiti/blog/item/546acdc74ab9bea28326ace4.html
百度面试题:判断url的类型_IT面试题_百度空间
http://hi.baidu.com/mianshiti/blog/item/bef194b4a34af6ff30add10f.html
Python | 一步一步学python - Part 3
http://www.91python.com/archives/tag/python/page/3
多进程——加快处理速度(文本数据)_foricee的空间_百度空间
http://hi.baidu.com/foricee/blog/item/96f49f2679a4cf174d088d7f.html


marxy's musing on technology: python multiprocessing pays off
http://blog.marxy.org/2010/07/python-multiprocessing-pays-off.html
awk: 计算一列数字的sum (车东[Blog^2])
http://www.chedong.com/blog/archives/000682.html
911 四年前的今天你在做什么? (车东[Blog^2]) gais
http://www.chedong.com/blog/archives/000987.html
到底对“索引”怎么样理解 - 入门技术 - Java - ITeye论坛
http://www.iteye.com/topic/1038366
简并算法:文本自动聚类算法的实现_刀剑笑_新浪博客
http://blog.sina.com.cn/s/blog_57cae499010009l8.html
昨日关注:再说机器新闻的分类和聚类的相关文章推荐 - - ITeye专栏频道
http://www.iteye.com/wiki/blog/940729
复杂商品分类的表如何建立? - 数据库 - Tech - ITeye论坛
http://www.iteye.com/topic/26987
数据自动归类 - 企业应用 - Java - ITeye论坛
http://www.iteye.com/topic/1014463
分类和聚类 - - ITeye技术网站
http://samuschen.iteye.com/blog/562352
《集体智慧编程》第3章:浅谈文档聚类 - mdyang - 博客园
http://www.cnblogs.com/mdyang/archive/2011/07/14/PCI-ch3.html
小默的研究中心 » 数据聚类
http://wpxiaomo.sinaapp.com/archives/426
《Data-intensive Text Processing with MapReduce》读书笔记(入口)2011.7.23最后更新 - mdyang - 博客园
http://www.cnblogs.com/mdyang/archive/2011/06/29/data-intensive-text-prcessing-with-mapreduce-contents.html
分治法解决MapReduce stripe模式内存瓶颈问题 - mdyang - 博客园
http://www.cnblogs.com/mdyang/archive/2011/07/21/mapreduce-stripe-vocabulary-divide-and-conquer.html
文本聚类研究 | BT的花
http://www.dup2.org/node/1015












e

Thursday, December 29, 2011

Daily Bookmarks 20111229

language features - Why doesn't a python dict.update() return the object? - Stack Overflow
http://stackoverflow.com/questions/1452995/why-doesnt-a-python-dict-update-return-the-object
python - Memory efficiency: One large dictionary or a dictionary of smaller dictionaries? - Stack Overflow
http://stackoverflow.com/questions/671403/memory-efficiency-one-large-dictionary-or-a-dictionary-of-smaller-dictionaries
Hash Table Benchmarks
http://incise.org/hash-table-benchmarks.html

Simple, Complex and Complicated: Python的Generator
http://voidpp.blogspot.com/2007/04/pythongenerator.html
Python dictionary implementation | Laurent Luce's Blog
http://www.laurentluce.com/posts/python-dictionary-implementation/
Is a Python dictionary an example of a hash table? - Stack Overflow
http://stackoverflow.com/questions/114830/is-a-python-dictionary-an-example-of-a-hash-table

The Unexpected Ubiquity of Spam Detection Algorithms - Technology Review
http://www.technologyreview.com/blog/mimssbits/26579/

grouping
jQuery Sortable - Limit number of items in list - Stack Overflow
http://stackoverflow.com/questions/2438516/jquery-sortable-limit-number-of-items-in-list
Group list items with jQuery - Stack Overflow
http://stackoverflow.com/questions/6110571/group-list-items-with-jquery
Advance grouping of List Items - Jquery - Stack Overflow
http://stackoverflow.com/questions/1666967/advance-grouping-of-list-items-jquery
jQuery Table Plugin with Group By - Stack Overflow
http://stackoverflow.com/questions/1576908/jquery-table-plugin-with-group-by


IOError: [Errno 32] Broken pipe - Google 搜尋
https://www.google.com/search?q=IOError:+%5BErrno+32%5D+Broken+pipe&hl=zh-TW&client=firefox&hs=6CD&rls=org.mozilla:zh-TW:official&prmd=imvnsfd&source=lnt&tbs=lr:lang_1zh-CN%7Clang_1zh-TW&lr=lang_zh-CN%7Clang_zh-TW&sa=X&ei=kTD8TqTZMfHtmAXy2vinCA&ved=0CAgQpwUoAQ&biw=1280&bih=876
I am LAZY bones ? : python中的子进程 subprocess
http://luy.li/2010/04/14/python_subprocess/
Ryutlis推荐 Python脚本的输出在unix下用管道,累积一段时间会出 IOError: [Errno 32] Broken pipe 错误
http://www.douban.com/people/ryutlis/rec/93655626/
Python Error/Exception: IOError: [Errno 32] Broken pipe | Jay Taylor
http://jaytaylor.com/blog/2009/11/06/python-errorexception-ioerror-errno-32-broken-pipe/
How to handle a broken pipe (SIGPIPE) in python? - Stack Overflow
http://stackoverflow.com/questions/180095/how-to-handle-a-broken-pipe-sigpipe-in-python
花&&猪 » Blog Archive » subprocess再解析
http://www.liuzhongshu.com/code/subprocess-detail.html



回顾scrapy » Libear | 专注于互联网的最新技术领域
http://blog.libears.com/2011-06-11/python/%E5%9B%9E%E9%A1%BEscrapy
铁丑磨成针 » MySQL的SELECT…ORDER BY原理学习
http://tiechou.info/?p=46
易用小爬虫Scrapy – 鼻有鼻有鼻有涕
http://bububut.com/2011/07/%E6%98%93%E7%94%A8%E5%B0%8F%E7%88%AC%E8%99%ABscrapy/










e

Wednesday, December 28, 2011

Daily Bookmarks 20111228

 Aproximative string matching - Python
http://bytes.com/topic/python/answers/390301-aproximative-string-matching
The soundex module
http://effbot.org/librarybook/soundex.htm
Projects FuGrep
http://www.j-raedler.de/projects/
FuGrep 0.50 : Python Package Index
http://pypi.python.org/pypi/FuGrep/0.50
OneZ Studio: Python Egg的形式
http://onezstudio.blogspot.com/2006/04/python-egg.html

An Introduction to Python Lists
http://effbot.org/zone/python-list.htm#performance
Python Hash Algorithms
http://effbot.org/zone/python-hash.htm

My program is too slow. How do I speed it up?
http://effbot.org/pyfaq/my-program-is-too-slow-how-do-i-speed-it-up.htm

wiLdGoose » 64 位 FreeBSD 上部署轻量级 Subversion
https://www.xuchao.org/technology/subversion_daemon_for_freebsd.html
Python的帖子:RE: [python-chinese] 请教一个多模式字符串匹配的问题 - 哲思
http://www.zeuux.org/group/python/bbs/content/37932/
WM算法-ChinaUnix博客 - IT人与你分享快乐生活
http://blog.chinaunix.net/space.php?uid=20435679&do=blog&id=1680201
字符串多模式精确匹配(脏字/敏感词汇搜索算法)——TTMP算法 之理论如此 - Sumtec - 博客园
http://www.cnblogs.com/sumtec/archive/2008/02/01/1061742.html
Wu-Manber算法 - 知足常乐 - 博客大巴
http://wzgyantai.blogbus.com/logs/46021622.html
Wu-Manber 经典多模式匹配算法 - 小北的家 - 博客频道 - CSDN.NET
http://blog.csdn.net/iJuliet/article/details/4206487
多模式字符串匹配 python - Google 搜尋
https://www.google.com/search?q=%E5%A4%9A%E6%A8%A1%E5%BC%8F%E5%AD%97%E7%AC%A6%E4%B8%B2%E5%8C%B9%E9%85%8D+python&hl=zh-TW&client=firefox&hs=k2b&rls=org.mozilla:zh-TW:official&prmd=imvns&ei=0jj7TrC1Aqv2mAWk1OAl&start=10&sa=N&biw=1280&bih=904

Searching and Replacing Replace multiple string pairs in one go (File: MultiReplace.py)
http://effbot.org/zone/python-replace.htm#multiple

PythonでWu-Manber
http://blog.kzfmix.com/entry/1197552786
Nikita's blog: Fuzzy string search
http://ntz-develop.blogspot.com/2011/03/fuzzy-string-search.html
关于geohash的简单探讨 - K_Reverter - 博客园
http://www.cnblogs.com/step1/archive/2009/04/22/1441689.html

Charming Python: Beat spam using hashcash
http://www.ibm.com/developerworks/linux/library/l-hashcash/index.html
Dspam Python Module
http://bmsi.com/python/dspam.html
Spam Filtering Techniques: -- Comparing a Half-Dozen Approaches to Eliminating Unwanted Email --
http://gnosis.cx/publish/programming/filtering-spam.html
Learning Spam and Ham
http://infocenter.guardiandigital.com/manuals/SecureMail/node80.html

Perl Hashes
http://www.misc-perl-info.com/perl-hashes.html#findoutph
Substring search algorithm
http://volnitsky.com/project/str_search/index.html
Python - Dictionary Data Type
http://www.tutorialspoint.com/python/python_dictionary.htm
Counter class « Python recipes « ActiveState Code
http://code.activestate.com/recipes/576611/
Converting a single ordered list in python to a dictionary, pythonically - Stack Overflow
http://stackoverflow.com/questions/1639772/converting-a-single-ordered-list-in-python-to-a-dictionary-pythonically
Code Like a Pythonista: Idiomatic Python
http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html
Python: List to Dictionary - Stack Overflow
http://stackoverflow.com/questions/4576115/python-list-to-dictionary
Looks Like It - The Hacker Factor Blog
http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html
Practical: A Spam Filter
http://www.gigamonkeys.com/book/practical-a-spam-filter.html
关于geohash的简单探讨 - K_Reverter - 博客园
http://www.cnblogs.com/step1/archive/2009/04/22/1441689.html
Python Programming - Email Whitelist for Spam Filters - Building an Email Whitelist to Train Your Spam Filter
http://python.about.com/od/pythonstandardlibrary/ss/email_whitelist.htm
Python: Bad Words Filter - jeff00seattle
http://sites.google.com/site/jeff00seattle/Home/python-coding/python--bad-words-filter



e

Tuesday, December 27, 2011

Daily Bookmarks 20111227

让 Archlinux 的 pacman 健步如飞 | Rest Valley
http://lihdd.net/2010/05/archlinux-pacman-accelerate/
千万别用MongoDB?真的吗?! | 酷壳 - CoolShell.cn
http://coolshell.cn/articles/5826.html

High Scalability - High Scalability - Product: Scribe - Facebook's Scalable Logging System
http://highscalability.com/product-scribe-facebooks-scalable-logging-system
benchmark.py - urllib3 - Python HTTP library with thread-safe connection pooling and file post support. - Google Project Hosting
http://code.google.com/p/urllib3/source/browse/test/benchmark.py

15.3. Zend_Service_Amazon
http://oss.org.cn/ossdocs/php/zend/ZendFramework-0.1.5/documentation/end-user/zh/zend.service.amazon.html
Introduction — PyTables 2.3.1 documentation
http://pytables.github.com/usersguide/introduction.html
分词算法的具体实践 | shell's home
http://shell909090.com/blog/2008/10/%E5%88%86%E8%AF%8D%E7%AE%97%E6%B3%95%E7%9A%84%E5%85%B7%E4%BD%93%E5%AE%9E%E8%B7%B5/

Some Notes on Tim Bray's Wide Finder Benchmark
http://effbot.org/zone/wide-finder.htm
The in operator ((An Unofficial) Python Reference Wiki)
http://pyref.infogami.com/in
KingsoftPythoner - cpyug - 金山长年招聘Py人才 - CPyUG~华蟒用户组 相关邮件列表管理通告收集/维护 - Google Project Hosting
http://code.google.com/p/cpyug/wiki/KingsoftPythoner#%E9%87%91%E5%B1%B1%E9%95%BF%E5%B9%B4%E6%8B%9B%E8%81%98Py%E4%BA%BA%E6%89%8D
python-segment - segmentation and classify library written by python - Google Project Hosting
http://code.google.com/p/python-segment/

路迢迢,人遥遥---关注c++,python,吃,人生和哲学 django
http://www.lutiaotiao.com/main/tags/15/0/
Wide Finder good site
http://dalkescientific.com/writings/diary/archive/2007/10/07/wide_finder.html#dalke-wf-9
other Wide Finder implementations
http://www.dalkescientific.com/writings/diary/archive/2007/10/10/other_wide_finder_implementations.html
Go deh!: Wide Finder on the command line
http://paddy3118.blogspot.com/2007/10/wide-finder-on-command-line.html
ongoing by Tim Bray · The Wide Finder Project
http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder


豆瓣-只看楼主-pyquery版 - 代码分享 - 开源中国社区
http://www.oschina.net/code/snippet_87626_7691
Python3筆摘: [pyquery] 抓網頁資料的神器
http://nopython.blogspot.com/2011/11/pyquery.html
PyQuery Tutorial: Basic HTML Parsing with PyQuery | Vert Studios
http://www.vertstudios.com/blog/pyquery-tutorial-basic-html-parsing-pyquery/
利用pyquery抓資料-Tomda生活點滴
http://543.vipe.idv.tw/2011/06/pyquery.html

未來數學家的挑戰 NP問題
http://episte.math.ntu.edu.tw/articles/mm/mm_10_2_04/index.html



e

Friday, December 23, 2011

Daily Bookmarks 20111223

14.1. hashlib — Secure hashes and message digests — Python v2.7.2 documentation
http://docs.python.org/library/hashlib.html#module-hashlib
Python 去除序列s中的重复元素
http://proupy.com/news/50
Consistent hashing implemented simply in Python - amix.dk
http://amix.dk/blog/post/19367
Source Checkout - hashdb - Library and Application for building database/s of file hash values - Google Project Hosting
http://code.google.com/p/hashdb/source/checkout
pda/flexihash - GitHub
https://github.com/pda/flexihash#readme
Entity Crisis: Consistent Hashing in Python
http://entitycrisis.blogspot.com/2010/05/consistent-hashing-in-python.html
memcached的分布式算法-Consistent Hashing « 排头兵 @ Talk
http://www.paitoubing.cn/blog/memcached_consistent_hashing
consistent-hashing in python — Gist
https://gist.github.com/1341846
一致性hash算法 - consistent hashing - sparkliang的专栏 - 博客频道 - CSDN.NET
http://blog.csdn.net/sparkliang/article/details/5279393
memcached全面剖析–4. memcached的分布式算法 - idv2
http://tech.idv2.com/2008/07/24/memcached-004/
solr sint 字段 hashcode 冲突高达99%,导致 solr-memcached 的 bug - Bory.Chan
http://blog.chenlb.com/2009/06/solr-sint-hashcode-conflict-cause-solr-memcached-bug.html


e

Thursday, December 22, 2011

Daily Bookmarks 20111222

Clustering text in Python - Stack Overflow
http://stackoverflow.com/questions/1789254/clustering-text-in-python
Cython三分钟入门 - 赖勇浩的编程私伙局 - 博客频道 - CSDN.NET
http://blog.csdn.net/lanphaday/article/details/4561611
Speed up your Python: Unladen vs. Shedskin vs. PyPy vs. Cython vs. C « Geet Duggal
http://geetduggal.wordpress.com/2010/11/25/speed-up-your-python-unladen-vs-shedskin-vs-pypy-vs-c/
Ian Bicking: a blog
http://blog.ianbicking.org/
CPython vs PyPy vs Cython
http://jaredforsyth.com/blog/2010/jul/21/cpython-vs-pypy-vs-cython/
solem's vision blog: Hierarchical Clustering in Python
http://www.janeriksolem.net/2009/04/hierarchical-clustering-in-python.html
亚马逊:查看所有 品牌
http://www.amazon.cn/gp/search/other/ref=sv_cps_0?ie=UTF8&n=665002051&pickerToList=brandtextbin
python脚本生成的图书小网站
http://www.douban.com/group/topic/6043510/
用Python轻松提取链接
http://www.blogkid.net/archives/1827.html
python clustering - Google 搜尋
https://www.google.com/search?q=python+clustering&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:zh-TW:official&client=firefox
Rebill's Blog » 修改 Ubuntu ulimit 限制
http://blog.rebill.info/archives/modify-ubuntu-ulimit-restrictions.html
洛阳铲的日志 : crawler collection
http://www.lyc.name/crawler-collection.html
solem's vision blog: Hierarchical Clustering in Python
http://www.janeriksolem.net/2009/04/hierarchical-clustering-in-python.html
通过ulimit改善系统性能 | 飘渺的风 | 个人的生活,学习,工作感悟
http://www.huanxiangwu.com/631/improve-system-performance-by-ulimit
sitescraper - Web scraping made simple - Google Project Hosting
http://code.google.com/p/sitescraper/
Learning Python by writing a screen scraper | Composite
http://bookmaniac.org/learning-python-by-writing-a-screen-scraper/
Write a Screen Scraper with Python - Prodigy Productions, LLC - Learn software development, computer vision, and A.I. programming.
http://www.prodigyproductionsllc.com/articles/programming/write-a-screen-scraper-with-python/

程式旅人 - 學習紀事 -: Python中的List Comprehension(列表綜合?)
http://nio127.blogspot.com/2008/10/pythonlist-comprehension.html


other
王元涛的Blog: [笔记]豆瓣校园宣讲会
http://todwang.blogspot.com/2007/12/blog-post.html
来淘宝的这一年:前篇、生活和工作 - JasonLee的专栏 - 博客频道 - CSDN.NET
http://blog.csdn.net/jasonblog/article/details/7026193
第一届PyCon China小记 - JasonLee的专栏 - 博客频道 - CSDN.NET
http://blog.csdn.net/jasonblog/article/details/7040420

Pythonic到底是什么玩意儿? - 赖勇浩的编程私伙局 - 博客频道 - CSDN.NET
http://blog.csdn.net/lanphaday/article/details/2762251
Python en:More - Notes
http://www.swaroopch.com/notes/Python_en:More#List_Comprehension
王元涛的Blog: Netflix Data的一些统计特性
http://todwang.blogspot.com/2008/11/netflix-data.html
王元涛的Blog: Second-hand Hash
http://todwang.blogspot.com/2009/02/second-hand-hash.html
xlvector – Recommender System - 推荐系统的有效性——Amazon到底是百分之多少
http://xlvector.net/blog/?p=802

How to use curl_multi() without blocking
http://www.onlineaspect.com/2009/01/26/how-to-use-curl_multi-without-blocking/
蚊子館: Varnish – 安裝
http://linux-guys.blogspot.com/2011/01/varnish.html
amazon robots.txt
http://www.amazon.com/robots.txt
Amazon ASIN listing and similarity graph : Free Download & Streaming : Internet Archive
http://www.archive.org/details/amazon_similarity_graph/
Internet Archive: Details: Amazon ASIN listing and similarity graph data source | Infochimps
http://www.infochimps.com/sources/internet-archive-details-amazon-asin-listing-and-similarity-grap
Infogami Dev Site (Infogami)
http://infogami.org/
(theinfo)
http://theinfo.org/
Aaron Swartz
http://www.aaronsw.com/
Xapian介绍及使用 » OpenSalon
http://www.opensalon.org/blog/2011/05/xapian-intro
XAPIAN学习1--倒排数据(库)建立,工厂模式应用 - goldenlock - 博客园
http://www.cnblogs.com/rocketfan/archive/2010/08/09/1796054.html

e

Wednesday, December 21, 2011

Daily Bookmarks 20111221

卓越网商品数据分级抓取 | GooSeeker
http://www.gooseeker.com/cn/node/document/metaseeker/cookbookv4/multilayers.html
自己动手写网络爬虫(附CD-ROM光盘1张)/罗刚-图书-卓越亚马逊 [搜索引擎]
http://www.amazon.cn/%E8%87%AA%E5%B7%B1%E5%8A%A8%E6%89%8B%E5%86%99%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB-%E7%BD%97%E5%88%9A/dp/product-description/B0047T6B4O/ref=dp_proddesc_0?ie=UTF8&s=books
先贴一个我的爬虫:amazon价格追踪器 - 未名空间(mitbbs.com)
http://www.mitbbs.com/clubarticle_t/Net_Parser/29458563.html
优秀案例点评之--使用Mashup和爬虫技术构建商品评价网 - 软酷快讯
http://express.ruanko.com/ruanko-express_15/webpage/tech2.html
python爬虫,去抓亚马逊、当当、豆瓣的信息。 - Husw!OnRoad 在路上 - 我就是个世界的部落格--------人生如行路,就让心灵去旅行~ 我一直在路上...
http://www.husw.net/blog/post/1033/
python爬虫,去抓亚马逊、当当、豆瓣的信息。 - Husw!OnRoad 在路上 - 我就是个世界 WyattWang 的麦库
http://note.sdo.com/u/1445822006/n/mbnUS~jpc8wwLX028001cZ
用Python轻松提取链接
http://www.blogkid.net/archives/1827.html
程式扎記: [ Java Crawler ] 分散式爬蟲 : 分散式儲存
http://puremonkey2010.blogspot.com/2011/12/java-crawler.html
用python从百度获取亚马逊的商品ID - 代码分享 - 开源中国社区
http://www.oschina.net/code/snippet_220262_7783
xlvector – Recommender System - python 爬虫
http://xlvector.net/blog/?p=83







e

Monday, December 19, 2011

Daily Bookmarks 20111219

Java实现Amazon数据抓取(包括Signature生成) | I'm Donkey
http://imdonkey.com/blog/archives/60

Princess Polymath » Blog Archive » Accessing Amazon’s Product Advertising API with Python -
http://www.princesspolymath.com/princess_polymath/?p=182

python-amazon-product-api 0.2.5 : Python Package Index
http://pypi.python.org/pypi/python-amazon-product-api/

Basic usage — python-amazon-product-api v0.2.5 documentation
http://packages.python.org/python-amazon-product-api/basic-usage.html
Product Advertising API
http://docs.amazonwebservices.com/AWSECommerceService/2009-11-01/DG/index.html?BrowseNodeIDs.html
Hack 8 Browse and Search Categories with Browse Nodes :: Chapter 1. Browsing and Searching :: Amazon hacks. Tips and tools :: Misc :: eTutorials.org
http://etutorials.org/Misc/amazon+tips+tools/Chapter+1.+Browsing+and+Searching/Hack+8+Browse+and+Search+Categories+with+Browse+Nodes/
关于shishijia.com的介绍 | 网络购物时时价
http://www.trachina.com/diary/%E6%97%B6%E6%97%B6%E4%BB%B7%E9%A1%B9%E7%9B%AE/shishijia
Working With the "One-Second" Rule
http://www.a2sdeveloper.com/page-working-with-the-one-second-rule.html
时时价 - 有正品保证的比价网,提供最新的商品比价、优惠及打折促销信息
http://shishijia.com/

Python食谱-1.23.Unicode数据编码输出到XML或HTML文件 | Shine.IT
http://blog.shine-it.net/python/encoding-unicode-data-for-xml-and-html
killer python projects: Making The New Amazon Product API Easy to Work With
http://webcache.googleusercontent.com/search?q=cache:EuB9wvZUaAYJ:pythonprojectwatch.blogspot.com/2011/12/making-new-amazon-product-api-easy-to.html+&cd=3&hl=zh-TW&ct=clnk&client=firefox
Google App Engine + Amazon Product Advertising API | Rutwick Gangurde's Blog
http://blog.rutwick.com/set-up-an-amazon-book-store-on-google-app-engine
How to make a music mashup – Muskblog
http://blog.muschamp.ca/2010/12/31/how-to-make-a-music-mashup/
Querying Amazon ECS with Boto | SysAdminPy
http://www.sysadminpy.com/2011/02/querying-amazon-ecs-with-boto/





z

Friday, December 16, 2011

Daily Bookmarks 20111216

Xapian编译安装及python binding的步骤 | 弱类型
http://troycheng.blogcn.com/articles/xapian%E7%BC%96%E8%AF%91%E5%AE%89%E8%A3%85%E5%8F%8Apython-binding%E7%9A%84%E6%AD%A5%E9%AA%A4.html
Xapian 如何发音? | BT的花
http://www.dup2.org/node/1422
World Hello - xapian索引的term处理
http://www.worldhello.net/2010/07/31/1613.html#more-1613
利用 xapian 建立索引 (python 版) - 系统架构 - python.cn(news, jobs)
http://simple-is-better.com/news/619
用php简单实现Search Engine Friendly的URL – 某人的栖息地
http://www.ooso.net/archives/174
Xapian构建自己的搜索引擎:检索 | 新鲜事 | 关注开源,互联网,游戏,技术,创业,云计算,架构,移动,生活
http://www.liulizhi.info/xapian%e6%9e%84%e5%bb%ba%e8%87%aa%e5%b7%b1%e7%9a%84%e6%90%9c%e7%b4%a2%e5%bc%95%e6%93%8e%ef%bc%9a%e6%a3%80%e7%b4%a2/
Xapian ( Python ) 之 TermGenerator 的简单理解和使用示例 - pyman hall - 博客频道 - CSDN.NET Good example
http://blog.csdn.net/zlchina1989/article/details/6777150
学习Xapian(4) – Faceting Search(Filter / 过滤) – 四号程序员
http://www.coder4.com/archives/2253
学习Xapian(1) – 基础的建索引和搜索 – 四号程序员
http://www.coder4.com/archives/2218
Xapian | Search Results | Gea-Suan Lin's BLOG
http://blog.gslin.org/?s=Xapian
幫 Pixnet 做 Fulltext Search | Gea-Suan Lin's BLOG
http://blog.gslin.org/archives/2007/06/15/1202/
Xapian 的幾個細節 | Gea-Suan Lin's BLOG
http://blog.gslin.org/archives/2007/08/10/1264/xapian-%e7%9a%84%e5%b9%be%e5%80%8b%e7%b4%b0%e7%af%80/
A Comparison of Open Source Search Engines | Vik's Blog
http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/
Pumping Up Your Applications with Xapian Full-Text Search | Nadav Samet's Blog
http://www.thesamet.com/blog/2007/02/04/pumping-up-your-applications-with-xapian-full-text-search/
折腾xapian的那点事 3 - twelfthing - 博客园
http://www.cnblogs.com/twelfthing/articles/1916112.html


python web.py使用flup lighttpd优化过程 – Tim[后端技术]
http://timyang.net/python/python-webpy-lighttpd/






z

Wednesday, December 14, 2011

Daily Bookmarks 20111214

python字符串匹配工具性能比较 | 弱类型
http://troycheng.blogcn.com/articles/python%e5%ad%97%e7%ac%a6%e4%b8%b2%e5%8c%b9%e9%85%8d%e5%b7%a5%e5%85%b7%e6%80%a7%e8%83%bd%e6%af%94%e8%be%83.html
py-contentfilter:敏感词过滤服务 | 弱类型
http://troycheng.blogcn.com/articles/py-contentfilter%EF%BC%9A%E6%95%8F%E6%84%9F%E8%AF%8D%E8%BF%87%E6%BB%A4%E6%9C%8D%E5%8A%A1.html

Facebook’s photo storage rewrite | Niall Kennedy
http://www.niallkennedy.com/blog/2009/04/facebook-haystack.html
MIT World » :Akamai 的故事︰從理論到實務(The Akamai Story: From Theory to Practice)
http://www.myoops.org/twocw/mitworld/video/199/index.htm
gentoo 更新系统-xorg-server 1.6.1.901-r4鼠标键盘失效解决 | 专注于linux安全
http://webcache.googleusercontent.com/search?q=cache:Pm3IWju4m04J:www.xslife.net/%3Fp%3D60+&cd=1&hl=zh-TW&ct=clnk&client=firefox
时隔两年,再回 Gentoo (一) -- anyLinux
https://anylinux.net/post/1617.html
Linux学习笔记(三百六十六)——Gentoo升级至Xorg 1.10后鼠标和键盘失效的问题 | 王不日天
http://huanhaoadam.wordpress.com/2011/09/08/linux%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0%EF%BC%88%E4%B8%89%E7%99%BE%E5%85%AD%E5%8D%81%E5%85%AD%EF%BC%89%E2%80%94%E2%80%94gentoo%E5%8D%87%E7%BA%A7%E8%87%B3xorg-1-10%E5%90%8E%E9%BC%A0%E6%A0%87%E5%92%8C/
Linux学习笔记(三百六十五)——利用指针减少C程序的空间开销 | 王不日天
http://huanhaoadam.wordpress.com/2011/07/06/linux%e5%ad%a6%e4%b9%a0%e7%ac%94%e8%ae%b0%ef%bc%88%e4%b8%89%e7%99%be%e5%85%ad%e5%8d%81%e4%ba%94%ef%bc%89%e2%80%94%e2%80%94%e5%88%a9%e7%94%a8%e6%8c%87%e9%92%88%e5%87%8f%e5%b0%91c%e7%a8%8b%e5%ba%8f/
gentoo 更新系统-xorg-server 1.6.1.901-r4鼠标键盘失效解决 | 专注于linux安全
http://www.xslife.net/?p=60

Tuesday, December 13, 2011

Daily Bookmarks 20111213

用Python实现CRUD功能REST服务 – Tim[后端技术]
http://timyang.net/python/python-rest/
某分布式应用实践一致性哈希的一些问题 – Tim[后端技术]
http://timyang.net/architecture/consistent-hashing-practice/
理解Python命名机制 - 赖勇浩的编程私伙局 - 博客频道 - CSDN.NET
http://blog.csdn.net/lanphaday/article/details/1734990
从HTML文件中抽取正文的简单方案 - 赖勇浩的编程私伙局 - 博客频道 - CSDN.NET
http://blog.csdn.net/lanphaday/article/details/1741185
三本可以一买的 Python 书 - 赖勇浩的编程私伙局 - 博客频道 - CSDN.NET
http://blog.csdn.net/lanphaday/article/details/6204639

Sunday, December 11, 2011

Daily Bookmarks 20111211

关于最近研究的关键词提取keyword extraction做的笔记 - caohao2008的专栏 - 博客频道 - CSDN.NET
http://blog.csdn.net/caohao2008/article/details/3144639
distance.py - nltk - Natural Language Toolkit Development - Google Project Hosting
http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/metrics/distance.py
程序员编程艺术:第三章续、Top K算法问题的实现 - 结构之法 算法之道 - 博客频道 - CSDN.NET good site
http://blog.csdn.net/v_JULY_v/article/details/6403777
aMMAI
http://chiehchi.blogspot.com/
找出Top K个数 - 就像以往 - 51CTO技术博客
http://dongdong1314.blog.51cto.com/389953/366991
zz:查找一段文字中最长的重复字串 – 编程珠玑(排过序的后缀数组的应用) | Bruce is coding !
https://www.cse.msu.edu/~liyang5/?p=53
统计单词出现次数--hash表,二叉树,标准库 - - 博客频道 - CSDN.NET
http://blog.csdn.net/lalor/article/details/7001357
十道海量数据处理面试题与十个方法大总结 - 结构之法 算法之道 - 博客频道 - CSDN.NET
http://blog.csdn.net/v_JULY_v/article/details/6279498
再谈脏字过滤(基于hash的优化算法) - 边城浪 - 博客园
http://www.cnblogs.com/yeerh/archive/2011/10/20/2219035.html
再度提升!.NET脏字过滤算法 - xingd - 博客园
http://www.cnblogs.com/xingd/archive/2008/02/01/1061800.html
高效的关键字过滤及查找算法(Trie KO Hash) - 边城浪 - 博客园
http://www.cnblogs.com/yeerh/archive/2011/08/24/2152607.html
海量数据实时计算随笔 | 搜索引擎技术博客
http://flychen.com/article/massive-data-real-time-computation-essays.html
search engine中的duplicate detection | In Programming We Trust
http://ptsolmyr.com/2010/08/13/duplicate_detection/
刘未鹏 新书 《暗时间》 (全)
http://www.douban.com/group/topic/20932914/
Storm :twitter的实时数据处理工具 - 论坛阅读
http://www.starming.com/index.php?action=plugin&v=wave&mid=34483&tid=15965


z

Friday, December 09, 2011

Daily Bookmarks 20111209

Approximate string matching - Wikipedia, the free encyclopedia
http://en.wikipedia.org/wiki/Approximate_string_matching




: 作業十七:互訊息Mutual Information - yam天空部落
http://blog.yam.com/Wfin/article/15631028
pealco/python-mutual-information - GitHub
https://github.com/pealco/python-mutual-information
Mutual Information 互信息的应用 - 好工具站长分享平台
http://www.haogongju.net/art/892435

统计自然语言处理---信息论基础 - zhoubl668的专栏:远帆,梦之帆! - 博客频道 - CSDN.NET
http://blog.csdn.net/zhoubl668/article/details/6923763
互信息 - 维基百科,自由的百科全书
http://zh.wikipedia.org/wiki/%E4%BA%92%E4%BF%A1%E6%81%AF
中央研究院-近代漢語標記語料庫 Academia Sinica Tagged Corpus of Early Mandarin Chinese
http://db1x.sinica.edu.tw/kiwi/pkiwi/early_mandarin_chinese_c_againhelp.html



基于数据挖掘的新词发现Approach for Lexicon Updating Based on ...

d.wanfangdata.com.cn › ... › 计算机应用研究2006年12期 - 轉為繁體網頁
由 王立希 著作 - 2006 - 被引用 5 次 - 相關文章
利用文本挖掘技术提出了一种用于主题式搜索引擎的专业词典库发现新专业词汇的 ... P Word Association Norms,Mutual Information and Lexicography 1990(01) ...

Tuesday, December 06, 2011

Daily Bookmarks 20111206

统计重复文本行的两种方法 | 我爱正则表达式
http://iregex.org/blog/get-duplicated-lines.html
统计汉字/英文单词数 | 我爱正则表达式
http://iregex.org/blog/words-counter-in-python.html
anti spam杂谈 | 我爱正则表达式
http://iregex.org/blog/anti-spam.html



z

Saturday, December 03, 2011

Daily Bookmarks 20111202

无穷循环——endless-loops.com » Python源码中的算法分析 之 字符串匹配算法
http://www.endless-loops.com/2011/02/python%E6%BA%90%E7%A0%81%E4%B8%AD%E7%9A%84%E7%AE%97%E6%B3%95%E5%88%86%E6%9E%90-%E4%B9%8B-%E5%AD%97%E7%AC%A6%E4%B8%B2%E5%8C%B9%E9%85%8D%E7%AE%97%E6%B3%95-452.html
发现Python的源代码中关于字符串fastsearch算法的一个笔误! - 学无止境 - 博客频道 - CSDN.NET
http://blog.csdn.net/studying/article/details/1426142


z

Daily Bookmarks 20111203

比较下桌面中 bsddb 和 zodb 的速度 -- anyLinux
https://anylinux.net/post/542.html

z

Thursday, December 01, 2011

Daily Bookmarks 20111201

Python中最快的字典排序方法 - 张沈鹏 - 知识梳理
http://zuroc.42qu.com/10037539
descriptor:Python Idiom: sort - 樂多日誌
http://blog.roodo.com/descriptor/archives/7883223.html

CiteSeerX — Discovering Identities in Web Contexts with Unsupervised Clustering
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.67.9602
/tags/release-0.1/src/colint/ch03/generatefeedvector.py – Colint
http://csrc.tamu-commerce.edu/projs/colint/browser/tags/release-0.1/src/colint/ch03/generatefeedvector.py

Paul Graham | 36氪
http://www.36kr.com/tag/paul-graham/page/3
Finding duplicate files using Python « Endlessly Curious
http://www.endlesslycurious.com/2011/06/01/finding-duplicate-files-using-python/
Get the MD5 Hash of a String - Python - Source Code | DreamInCode.net
http://www.dreamincode.net/code/snippet1851.htm


Automatic Keyword Extraction
https://docs.google.com/viewer?a=v&q=cache:qwsqMHh78I8J:www.l3s.de/~demidova/students/thesis_oelze.pdf+term+frequency+extract+keyword+product&hl=zh-TW&gl=tw&pid=bl&srcid=ADGEESjBgTvNmtNRiGNH4xTrPzw3ZryPhvbQ0MqutWg80zD0eSSOTw4517JkwyvkjivxzD1BkkxBJ4onqQ8YiCp9S_y89rukOkJAAh4AMLmb6fqZpw5AQOaHUeEadIw_XMFyvdq3MdYl&sig=AHIEtbS8553nR9Q9wG7EDtUtj4RBnM0ljg

hash table
十一、从头到尾彻底解析Hash表算法 - 结构之法 算法之道 - 博客频道 - CSDN.NET Good
http://blog.csdn.net/v_july_v/article/details/6256463
打造最快的Hash表(转) - 我风 - C++博客
http://www.cppblog.com/kyelin/archive/2007/08/21/30506.aspx
Alex's Blog : 打造最快的Hash表(和Blizzard的对话)
http://blog.itpub.net/post/670/16449
uthash: a hash table for C structures
http://uthash.sourceforge.net/
HASH表原理 - 未知世界 - ITeye技术网站
http://calmness.iteye.com/blog/184465

How Entity Extraction is Fueling the Semantic Web Fire - O'Reilly Broadcast
http://broadcast.oreilly.com/2009/02/how-entity-extraction-is-fueli.html
[totti's blog] 命名是件麻煩的事
http://webcache.googleusercontent.com/search?q=cache:bNykOxSvuxEJ:totti-yang.blogspot.com/+Named+Entity&cd=3&hl=zh-TW&ct=clnk&gl=tw&client=firefox-a
Named-Entity Recognition product - Google 搜尋
http://www.google.com.tw/search?q=Named-Entity+Recognition+product&hl=zh-TW&client=firefox-a&hs=EBV&rls=org.mozilla:zh-TW:official&prmd=imvns&ei=tKrXTs3qLOGRiQfpnOX-DQ&start=10&sa=N&biw=1235&bih=649
Named Entity Extraction
http://infoglutton.com/yooname-named-entity-recognition.html
Named entity recognition with preset list of names for Python / PHP - Stack Overflow
http://stackoverflow.com/questions/4206882/named-entity-recognition-with-preset-list-of-names-for-python-php
Named-entity recognition - Wikipedia, the free encyclopedia
http://en.wikipedia.org/wiki/Named-entity_recognition

http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html








z

Wednesday, November 30, 2011

Daily Bookmarks 20111130

用uniq命令求多个文本文件的交集,并集和差集_骚人默客_新浪博客
http://blog.sina.com.cn/s/blog_5133d4dd0100lemw.html

python有没有什么包能判断文本相似度啊?
http://www.douban.com/group/topic/5712159/

索引和查找
ir.hit.edu.cn/phpwebsite/index.php?module... - 轉為繁體網頁

Larbin[1]hashtable checker 源代码分析  quweiprotoss的日志  网易博客 vert Good site
http://quweiprotoss.blog.163.com/blog/static/4088288320103190243558/
由Larbin到关于海量数据处理_sunshinesandy_百度空间
http://hi.baidu.com/sunshinesandy/blog/item/4aab0e0e0dc43e2ce82488c7.html
海量数据 » 码农 | 关注互联网,算法,开发
http://blog.redfox66.com/post/category/search-tech/massdata
网络爬虫--larbin - to myself 的分类学习日志 - C++博客
http://www.cppblog.com/toMyself/archive/2010/08/28/125073.aspx


7H2O | 汽水森林
http://www.7h2o.com/category/python/

smallseg - DFA Based Chinese Word Segmentation Library of Python and Java - Google Project Hosting
http://code.google.com/p/smallseg/

用python简单实现中文分词 - FreeDoDo.com
http://www.freedodo.com/2011/03/28/%E7%94%A8python%E7%AE%80%E5%8D%95%E5%AE%9E%E7%8E%B0%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D.html

TF-IDF实现自动提取标签 - 快乐学习 - 不烦恼的博客
http://bufannao.com/archives/TF-IDF.html

网络内容推荐系统 | 信研所::管理信息系统相关专业分享社区
http://www.misins.org/wcrs

TF-IDF算法实验 - MrYang's Blog - 博客大巴
http://mryang.blogbus.com/logs/45675845.html
文本分析漫谈-分类器中的关键词提取 « UGC广播站

http://ugc.renren.com/2010/02/01/keywords-extraction-overview/

Automatic Keyword Extraction - Homepage of Cheng-Zhi Zhang
https://sites.google.com/site/zhangczhomepage/keyword-extraction



z

Sunday, November 27, 2011

Daily Bookmarks 20111127

快速URL排重的方法
http://www.360doc.com/content/08/1031/15/3500_1855560.shtml
~/.trash » bloom filter 备忘(1)
http://grepk.com/?p=605
BloomFilter–大规模数据处理利器(解决空查问题) | dbafree首页
http://www.dbafree.net/?p=36
常用于web spider中URL排重的Bloom Filter算法学习… | 互联网,请记住我
http://www.162cm.com/archives/783.html
网络爬虫设计——URL去重存储库设计_守护地下铁_百度空间
http://hi.baidu.com/shirdrn/blog/item/40ed0fb1ceac4d5c0923029d.html

不简单的URL去重 - 智障大师 的专栏 - 博客频道 - CSDN.NET
http://blog.csdn.net/historyasamirror/article/details/6746217
NoSQL数据库笔谈
http://sebug.net/paper/databases/nosql/Nosql.html#_8314717379700977_930601348298
Oracle Berkeley DB 中国研发团队的博客 » embedded
http://www.bdbchina.com/tag/embedded/
大量url去重问题 - jollyjumper的专栏 - 博客频道 - CSDN.NET
http://blog.csdn.net/jollyjumper/article/details/6415723
程式扎記: [ Java Crawler ] 設計爬蟲佇列
http://puremonkey2010.blogspot.com/2011/10/java-crawler_20.html
Bloom Filter « Python recipes « ActiveState Code 22line good site
http://code.activestate.com/recipes/577684-bloom-filter/
Bloom filters and a simple spell checker in Python
http://lists.canonical.org/pipermail/kragen-hacks/2006-August/000431.html

搜索引擎重复网页发现技术分析(续) - 我的BT下载实验室 - ITeye技术网站 nice site title
http://wangdei.iteye.com/blog/376721

Coding Horror: URL Shortening: Hashes In Practice
http://www.codinghorror.com/blog/2007/08/url-shortening-hashes-in-practice.html

Crawling the Web
dollar.biz.uiowa.edu/~pant/Papers/crawling.pdf

布隆算法在url去重中的应用_菩提树_新浪博客 very nice site
http://blog.sina.com.cn/s/blog_7165756b0100odeu.html
K值聚类在同价位中的应用_菩提树_新浪博客
http://blog.sina.com.cn/s/blog_7165756b0100odij.html
布隆过滤器在网页去重中的应用-泪下的天空-我的搜狐 nice site C++實作 可看
http://jinyun2012.blog.sohu.com/163477317.html
Url排重Bloom Filter 算法、误差及其他 - 我要去桂林—田春峰的IT网志 - IT改进生活
http://blog.donews.com/accesine/archive/2007/01/23/1118640.aspx
larbin中URL的去重-Bloom Filter算法 - piziwang - ITeye技术网站
http://piziwang.iteye.com/blog/740394
Larbin : Parcourir le web, telle est ma passion
http://larbin.sourceforge.net/index-eng.html

THE VERY SIMPLE HASH TABLE EXAMPLE (Java, C++) | Algorithms and Data Structures
http://www.algolist.net/Data_structures/Hash_table/Simple_example
Python dictionary implementation | Laurent Luce's Blog
http://www.laurentluce.com/posts/python-dictionary-implementation/

nutch源代码阅读心得 - CookStar - 博客园
http://www.cnblogs.com/clarkchen/archive/2011/02/22/1960892.html




z

Thursday, November 24, 2011

Daily Bookmarks 20111124

十道海量数据处理面试题与十个方法大总结 - 结构之法 算法之道 - 博客频道 - CSDN.NET
http://blog.csdn.net/v_JULY_v/article/details/6279498
Jonathan Ellis's Programming Blog - Spyced: All you ever wanted to know about writing bloom filters
http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
十一、从头到尾彻底解析Hash表算法 - 结构之法 算法之道 - 博客频道 - CSDN.NET
http://blog.csdn.net/v_JULY_v/article/details/6256463


z

Wednesday, November 23, 2011

Daily Bookmarks 20111123

Bloom Filter « dev.poga.tw
http://devpoga.wordpress.com/2010/01/07/bloom-filter/
Bloom filter - 使用 Hash 來判斷元素是否存在於一個集合中 @ 第二十四個夏天後 :: 痞客邦 PIXNET ::
http://changyy.pixnet.net/blog/post/22946355-bloom-filter---%E4%BD%BF%E7%94%A8-hash-%E4%BE%86%E5%88%A4%E6%96%B7%E5%85%83%E7%B4%A0%E6%98%AF%E5%90%A6%E5%AD%98%E5%9C%A8%E6%96%BC%E4%B8%80
Sigma » Hash和Bloom Filter
http://www.sigma.me/2011/09/13/hash-and-bloom-filter.html
海量数据处理之Bloom Filter详解 - 结构之法 算法之道 - 博客频道 - CSDN.NET
http://blog.csdn.net/v_july_v/article/details/6685894
基于Bloom-Filter算法的URL过滤器的实现 - 黑色的菠菜 - ITeye技术网站
http://lywybo.iteye.com/blog/787337
Nothing: Bloom Filter nice site
http://wnicholas.blogspot.com/2010/04/bloom-filter.html
Bloom Filter(python版)_学思之家_百度空间 good site
http://hi.baidu.com/ruclin/blog/item/6eceed50e2124715367abe94.html/cmtid/7c5edad70066212006088b64
Hash lookups, Database lookups, and Scalability
http://www.perlmonks.org/?node_id=404105

Bloom Filter Resources | BitWorking | Joe Gregorio
http://bitworking.org/news/380/bloom-filter-resources
标  题: 大数据量,海量数据 处理方法总结
http://bbs.xjtu.edu.cn/BMYAJBDVQSTVHSJUADPOGJEVMYLABIFCXFQP_B/con?B=Algorithm&F=M.1259224358.A&N=3682&T=0
jython - Modern, high performance bloom filter in Python? - Stack Overflow
http://stackoverflow.com/questions/311202/modern-high-performance-bloom-filter-in-python
搜索引擎中的URL散列 - Linux环境高级编程 - 博客频道 - CSDN.NET
http://blog.csdn.net/21aspnet/article/details/6596746
网络爬虫设计——URL去重存储库设计_守护地下铁_百度空间
http://hi.baidu.com/shirdrn/blog/item/40ed0fb1ceac4d5c0923029d.html
图片服务器的url hash架构 - Linux环境高级编程 - 博客频道 - CSDN.NET
http://blog.csdn.net/21aspnet/article/details/6596745
节约内存:Instagram的Redis实践 - NoSQLFan - 关注NoSQL相关技术、新闻
http://blog.nosqlfan.com/html/3379.html

Tuesday, November 22, 2011

Sunday, November 20, 2011

Daily Bookmarks 20111120

Magento 导出Product数据提交到 Google Products-用代码导出XML | mivec
http://mivec.ixsbn.com/archives/magento-export-product-and-submit-to-google-product-part-ii
one tip for title optimisation|Data Feed|每天进步一点点
http://holabell.com/googleseo/data-feed/one-tip-for-title-optimisation
Google Product Search 结果显示 YouTube 评测视频 | 谷奥——探寻谷歌的奥秘
http://www.guao.hk/posts/youtube-video-reviews-now-part-of-google-product-search.html
Google力圖通過新的商戶規則來提升Product Search的搜索質量 | 谷飯
http://goofan.net/6578/google-sought-new-business-rules-to-improve-search-quality-product-search.html
Google學習百度,力推其購物搜索?--KPN關鍵字SEO服務
http://kpnweb.com/news_content.asp?n_id=114
迎接圣诞假期购物狂潮,Google Product Search 全面升级 | 谷奥——探寻谷歌的奥秘
http://www.guao.hk/posts/google-product-search-updates-for-holiday-seasons.html#more-13767
商品 Feed 规范 - Google Merchant Center帮助
http://www.google.com/support/merchants/bin/answer.py?hl=zh-Hans&answer=188494#other
百度商品搜索常见问题-网页搜索帮助
http://open.baidu.com/coop/productQuestion.html#01
百度开放平台_开放类目详细信息
http://open.baidu.com/cms/static/category.html?name=goods#category_guide_8
怎样生成百度开放平台xml文件_百度经验
http://jingyan.baidu.com/article/6079ad0e49c6d228ff86dbd5.html

http://www.google.com/basepages/producttype/taxonomy.en-US.txt
PChome EC搜尋引擎,跨站台找商品!
http://briian.com/?p=6119
Fun3C - 最方便精準的3C比價網站
http://fun3c.lingtelli.com/
《數位之牆》網擎Openfind搶先推出台灣首家商品搜尋服務
http://www.digitalwall.com/scripts/displaypr.asp?UID=1641

Wednesday, November 16, 2011

Monday, November 07, 2011

Daily Bookmarks 20111107

基于词典的正向最大匹配中文分词算法,能实现中英文数字混合分词 - lucene + hadoop 分布式并行计算搜索框架 - BlogJava
http://www.blogjava.net/nianzai/archive/2011/08/04/355786.html
全文内容推荐引擎之中文分词 - 51CTO.COM
http://database.51cto.com/art/201108/284085.htm
台扣啵的研究日誌:[PHP] Zend_Search_Lucene中文分詞實做 - 樂多日誌
http://blog.roodo.com/taikobo0/archives/6027073.html
中文分词入门之最大匹配法 | 我爱自然语言处理
http://www.52nlp.cn/maximum-matching-method-of-chinese-word-segmentation
MIT自然语言处理第二讲:单词计数(第四部分) | 我爱自然语言处理
http://www.52nlp.cn/mit-nlp-second-lesson-word-counting-fourth-part
搜索&广告 « 自娱自乐
http://xjzhou.wordpress.com/category/machinelearning/%E6%90%9C%E7%B4%A2%E5%B9%BF%E5%91%8A/
Google 黑板报 - Google (谷歌)中国的博客网志,走近我们的产品、技术和文化: 数学之美 系列二 -- 谈谈中文分词
http://www.google.com.hk/ggblog/googlechinablog/2006/04/blog-post_2507.html
谷歌浏览器 Chrome 里牛逼的中文分词 - 杂项其他 - python.cn(news, jobs)
http://simple-is-better.com/news/319
代码分享-层叠法计算文本相似度(算法/数据结构) -by TY -pythoner.net
http://pythoner.net/code/31/

一个简单的srt字幕多行转单行的脚本[Python] | Felix's Blog
http://blog.felixc.at/2010/07/srt-multiline-convert-python/
Python: 纯文本转PNG | Felix's Blog
http://blog.felixc.at/2011/05/python-text-to-png/
Python 新浪微博 各种表情使用频率 - L Cooper - 博客园
http://www.cnblogs.com/Lannik/archive/2011/10/21/2219776.html

Sunday, November 06, 2011

Daily Bookmarks 20111105

基于JSON格式数据的Ajax分页实现 « 老韩
http://www.handaoliang.com/article_94.html
基于jquery Json Ajax实现的实用的搜索与分页效果代_HTML5,JS代码,Div+CSS,CSS3,酷站欣赏 - 网页前端吧 - 开心工作,快乐生活
http://www.jscss8.com/jsad/qitadaima/20100729_6258.html
How to use jQuery to paginate JSON data? - Stack Overflow
http://stackoverflow.com/questions/2507844/how-to-use-jquery-to-paginate-json-data
jQuery Pagination Ajax分页插件中文详解 « 张鑫旭-鑫空间-鑫生活
http://www.zhangxinxu.com/wordpress/2010/01/jquery-pagination-ajax%E5%88%86%E9%A1%B5%E6%8F%92%E4%BB%B6%E4%B8%AD%E6%96%87%E8%AF%A6%E8%A7%A3/
Pagination | jQuery Plugins
http://plugins.jquery.com/project/pagination
jQuery plugin: Tablesorter 2.0 - Pager plugin
http://tablesorter.com/docs/example-pager.html
Making a jQuery pagination system | web enavu
http://web.enavu.com/tutorials/making-a-jquery-pagination-system/
Lighty RoR: Pagination :讓分頁不再繁瑣
http://lightyror.thegiive.net/2006/11/pagination.html
How to Paginate Data with PHP | Nettuts+
http://net.tutsplus.com/tutorials/php/how-to-paginate-data-with-php/



P->NP->NP-complete-NP-hard问题之浅析 - 张林林|深蓝(Linlin Zhang,shenlan211314) 的专栏 - 博客频道 - CSDN.NET
http://blog.csdn.net/shenlan211314/article/details/6232472
~Paganini Amadeus' Notebook~: P vs NP vs NP-Hard vs NP-Complete - yam天空部落
http://blog.yam.com/hn12303158/article/19947114


Infinite Loop: 【筆記】建立下載 Youtube 影片的 Chrome Extension
http://program-lover.blogspot.com/2011/08/youtube-chrome-extension.html



Thursday, November 03, 2011

Daily Bookmarks 20111103

The stringlib Library
http://effbot.org/zone/stringlib.htm
python - improving Boyer-Moore string search - Stack Overflow
http://stackoverflow.com/questions/1106112/improving-boyer-moore-string-search
elastic search,又一个基于lucene的nosql好项目 | summersmile1984 的个人站点
http://summersmile1984.i-branding.me/2011/03/31/elastic-search%E5%8F%88%E4%B8%80%E4%B8%AA%E5%9F%BA%E4%BA%8Elucene%E7%9A%84nosql%E5%A5%BD%E9%A1%B9%E7%9B%AE/
[projects] Contents of /python/trunk/Objects/stringlib/fastsearch.h
http://svn.python.org/view/python/trunk/Objects/stringlib/fastsearch.h?revision=77470&view=markup
Lucid Imagination » Exploring Lucene’s Indexing Code: Part 2
http://www.lucidimagination.com/blog/2009/03/18/exploring-lucenes-indexing-code-part-2/
Delve inside the Lucene indexing mechanism
http://www.ibm.com/developerworks/library/wa-lucene/

How to Index PDF Documents with Lucene | kalani's Tech blog
http://kalanir.blogspot.com/2008/08/indexing-pdf-documents-with-lucene.html
elasticsearch - - Open Source, Distributed, RESTful, Search Engine
http://www.elasticsearch.org/
Study notes 4.3 - Document filtering: Use Naive Bayes_土老冒_百度空间
http://hi.baidu.com/idontknow1987/blog/item/f36adcc5e5e87da48326ac4b.html
PyLucene安装使用简介 | 非鱼观点-互联网观察
http://www.unfish.net/archives/269-20080118.html
绚丽也尘埃 » PyLucene in Action
http://www.fuzhijie.me/?p=273
SourceForge.net: Benchmarks - clucene
http://sourceforge.net/apps/mediawiki/clucene/index.php?title=Benchmarks
Django and Lupy
http://www.rkblog.rk.edu.pl/w/p/django-lupy/

Xapian performance comparision with Whoosh « Searching with Xapian
http://xapian.wordpress.com/2009/02/12/xapian-performance-comparision-with-whoosh/

xapwrap - xapian php调用包装程序支持中文检索 - Google Project Hosting
http://code.google.com/p/xapwrap/


利用 xapian 建立索引 (python 版) - 系统架构 - python.cn(news, jobs)
http://simple-is-better.com/news/619


Stemming Algorithm - 荡气回肠,奔流不息 - tayoto - 和讯博客
http://tayoto.blog.hexun.com/38957815_d.html


在线演示|中文分词|PHP中文分词 - 开源免费的简易中文分词系统
http://www.ftphp.com/scws/demo.php

关于 xunsearch - 迅搜(xunsearch) - 开源免费中文全文搜索引擎
http://www.xunsearch.com/about

纵横搜索
http://discuz.qq.com/service/search

中文分词 « 神仙的仙居
http://xiezhenye.com/tag/%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D

Python 中文分词:用纯python实现 / FMM 算法 / pymmseg-cpp / smallseg / judou 句读 / BECer-GAE - 杂项其他 - python.cn(news, jobs)
http://simple-is-better.com/news/387






Wednesday, November 02, 2011

Daily Bookmarks 20111102

深入 Lucene 索引机制
http://www.ibm.com/developerworks/cn/java/wa-lucene/
实战 Lucene,第 1 部分: 初识 Lucene
http://www.ibm.com/developerworks/cn/java/j-lo-lucene1/index.html

python
理解Python中的装饰器 | 代码回音
http://www.codecho.com/understanding-python-decorators/#more-131893
Python折半搜索算法(二分法) | 代码回音
http://www.codecho.com/python-binary-search-algorithm/
Python中如何使用*args 和 **kwargs | 代码回音
http://www.codecho.com/how-to-use-args-and-kwargs-in-python/
Python: speed vs. memory tradeoff reading files | handyfloss
http://handyfloss.net/2008.02/python-speed-vs-memory-tradeoff-reading-files/
Reading Text Files
http://effbot.org/zone/readline-performance.htm

c++ &的用法 - Johnson的研究筆記
https://sites.google.com/site/johnsonsnote/c-c-c-xue-xi-bi-ji/c-de-yong-fa

介绍一下Hyper Estraier - 陈叙远 - 博客园
http://www.cnblogs.com/jjstar/archive/2006/12/08/586531.html
tokyo cabinet - Google 搜尋
http://www.google.com.tw/search?q=tokyo+cabinet&hl=zh-TW&prmd=imvns&ei=TDSxTuHDMerxmAWo1_ySAg&start=10&sa=N&biw=1440&bih=809

Tokyo Cabinet:另一个DBM实现 | 互联网,请记住我
http://www.162cm.com/archives/681.html
Tokyo Cabinet Key Value数据库及其扩展应用
http://webcache.googleusercontent.com/search?q=cache:5GpMohnGt-MJ:www.slideshare.net/rewinx/tokyo-cabinet-key-value+tokyo+cabinet&cd=14&hl=zh-TW&ct=clnk&gl=tw
开源搜索引擎Hyper Estraier性能小测&缺点总结--覃健祥 | chin at blogchina
http://chin.bokee.com/6784704.html

Efficient substring searching – Phusion Corporate Blog
http://blog.phusion.nl/2010/12/06/efficient-substring-searching/
Making Python grep
http://casa.colorado.edu/~ginsbura/pygrep.htm
Obtain substring using python. (Page 1) / Programming & Scripting / Arch Linux Forums
https://bbs.archlinux.org/viewtopic.php?id=126959
Find All Indices of a SubString in a Given String « Python recipes « ActiveState Code
http://code.activestate.com/recipes/499314-find-all-indices-of-a-substring-in-a-given-string/

Tuesday, November 01, 2011

Daily Bookmarks 20111101

 MapReduce-免费午餐还没有结束? - 匪夷所思的人 - ITeye技术网站
http://banditjava.iteye.com/blog/246160
搜索引擎术语 - 匪夷所思的人 - ITeye技术网站
http://banditjava.iteye.com/blog/253184
Lucene倒排索引原理 - 唯以不永伤
http://liubingnlp.appspot.com/?p=10003

Using SQLite from Python
http://www.comp.mq.edu.au/units/comp249/pythonbook/pythoncgi/pysqlite.html
Python获取Yahoo天气 | 代码回音
http://www.codecho.com/fetching-yahoo-weather-using-python/
Python中使用Sqlite | 代码回音
http://www.codecho.com/using-sqlite-in-python/
SQLite Python tutorial Good site
http://zetcode.com/db/sqlitepythontutorial/
Command Line Shell For SQLite
http://www.sqlite.org/sqlite.html
PHP 程式 學習 筆記本: [引用]MongoDB入門簡介
http://calos-tw.blogspot.com/2010/03/mongodb.html

Build MongoDB on FreeBSD -- for Jenkins use — Koansys
http://koansys.com/tech/build-mongodb-on-freebsd-for-jenkins-use
小默的研究中心 » 集体智慧
http://wpxiaomo.sinaapp.com/?cat=29
python - pysqlite2: ProgrammingError - You must not use 8-bit bytestrings - Stack Overflow
http://webcache.googleusercontent.com/search?q=cache:FDT-qryM438J:stackoverflow.com/questions/2838100/pysqlite2-programmingerror-you-must-not-use-8-bit-bytestrings+&cd=1&hl=zh-TW&ct=clnk&client=firefox

Saturday, October 29, 2011

Daily Bookmarks 20111029

[Unix] 使用 Tarball 安裝 SQLite + Python + Apache + Django @ FreeBSD 7.2 @ 第二十四個夏天後 :: 痞客邦 PIXNET ::
http://changyy.pixnet.net/blog/post/25225752-%5Bunix%5D-%E4%BD%BF%E7%94%A8-tarball-%E5%AE%89%E8%A3%9D-sqlite-%2B-python-%2B-apache-%2B-djan
Installation - prettytable - How to install PrettyTable - A simple Python library for easily displaying tabular data in a visually appealing ASCII table format - Google Project Hosting
http://code.google.com/p/prettytable/wiki/Installation
搜索引擎之中文分词(Chinese Word Segmentation)简介 | 中文Flex例子
http://blog.minidx.com/2008/01/04/352.html
邮件过滤
http://people.ubuntu.com/~happyaron/ubuntu-docs-test/lucid/serverguide/zh_CN/mail-filtering.html
Minidx文件管理系统 | Minidx全文搜索引擎 - 主页
http://cn.minidx.com/
Python NLTK chinese - Google 搜尋
http://www.google.com.tw/search?q=Python+NLTK+chinese&hl=zh-TW&prmd=imvns&ei=eoitTvTvC5KkiQfyivnGDw&start=10&sa=N&biw=1235&bih=663
Machine Learning for Email - O'Reilly Media
http://shop.oreilly.com/product/0636920022350.do
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html
伯克利大學學生開發圖片驗證碼Picatcha終結文字驗證碼CAPTCHA - Yahoo!奇摩3C科技
http://tw.tech.yahoo.com/network_trend/article/id/20953/
Classification: Spam Filtering python - Google 搜尋
http://www.google.com.tw/search?q=Classification:+Spam+Filtering+python&hl=zh-TW&prmd=imvns&source=lnt&tbs=lr:lang_1zh-CN%7Clang_1zh-TW&lr=lang_zh-CN%7Clang_zh-TW&sa=X&ei=LYatTsi1BamtiQfvo_jMDA&ved=0CAgQpwUoATgK&biw=1235&bih=663










印象·PKU - 人人小站
http://zhan.renren.com/jasony