Wednesday, November 30, 2011

Daily Bookmarks 20111130

用uniq命令求多个文本文件的交集,并集和差集_骚人默客_新浪博客
http://blog.sina.com.cn/s/blog_5133d4dd0100lemw.html

python有没有什么包能判断文本相似度啊?
http://www.douban.com/group/topic/5712159/

索引和查找
ir.hit.edu.cn/phpwebsite/index.php?module... - 轉為繁體網頁

Larbin[1]hashtable checker 源代码分析  quweiprotoss的日志  网易博客 vert Good site
http://quweiprotoss.blog.163.com/blog/static/4088288320103190243558/
由Larbin到关于海量数据处理_sunshinesandy_百度空间
http://hi.baidu.com/sunshinesandy/blog/item/4aab0e0e0dc43e2ce82488c7.html
海量数据 » 码农 | 关注互联网,算法,开发
http://blog.redfox66.com/post/category/search-tech/massdata
网络爬虫--larbin - to myself 的分类学习日志 - C++博客
http://www.cppblog.com/toMyself/archive/2010/08/28/125073.aspx


7H2O | 汽水森林
http://www.7h2o.com/category/python/

smallseg - DFA Based Chinese Word Segmentation Library of Python and Java - Google Project Hosting
http://code.google.com/p/smallseg/

用python简单实现中文分词 - FreeDoDo.com
http://www.freedodo.com/2011/03/28/%E7%94%A8python%E7%AE%80%E5%8D%95%E5%AE%9E%E7%8E%B0%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D.html

TF-IDF实现自动提取标签 - 快乐学习 - 不烦恼的博客
http://bufannao.com/archives/TF-IDF.html

网络内容推荐系统 | 信研所::管理信息系统相关专业分享社区
http://www.misins.org/wcrs

TF-IDF算法实验 - MrYang's Blog - 博客大巴
http://mryang.blogbus.com/logs/45675845.html
文本分析漫谈-分类器中的关键词提取 « UGC广播站

http://ugc.renren.com/2010/02/01/keywords-extraction-overview/

Automatic Keyword Extraction - Homepage of Cheng-Zhi Zhang
https://sites.google.com/site/zhangczhomepage/keyword-extraction



z

Sunday, November 27, 2011

Daily Bookmarks 20111127

快速URL排重的方法
http://www.360doc.com/content/08/1031/15/3500_1855560.shtml
~/.trash » bloom filter 备忘(1)
http://grepk.com/?p=605
BloomFilter–大规模数据处理利器(解决空查问题) | dbafree首页
http://www.dbafree.net/?p=36
常用于web spider中URL排重的Bloom Filter算法学习… | 互联网,请记住我
http://www.162cm.com/archives/783.html
网络爬虫设计——URL去重存储库设计_守护地下铁_百度空间
http://hi.baidu.com/shirdrn/blog/item/40ed0fb1ceac4d5c0923029d.html

不简单的URL去重 - 智障大师 的专栏 - 博客频道 - CSDN.NET
http://blog.csdn.net/historyasamirror/article/details/6746217
NoSQL数据库笔谈
http://sebug.net/paper/databases/nosql/Nosql.html#_8314717379700977_930601348298
Oracle Berkeley DB 中国研发团队的博客 » embedded
http://www.bdbchina.com/tag/embedded/
大量url去重问题 - jollyjumper的专栏 - 博客频道 - CSDN.NET
http://blog.csdn.net/jollyjumper/article/details/6415723
程式扎記: [ Java Crawler ] 設計爬蟲佇列
http://puremonkey2010.blogspot.com/2011/10/java-crawler_20.html
Bloom Filter « Python recipes « ActiveState Code 22line good site
http://code.activestate.com/recipes/577684-bloom-filter/
Bloom filters and a simple spell checker in Python
http://lists.canonical.org/pipermail/kragen-hacks/2006-August/000431.html

搜索引擎重复网页发现技术分析(续) - 我的BT下载实验室 - ITeye技术网站 nice site title
http://wangdei.iteye.com/blog/376721

Coding Horror: URL Shortening: Hashes In Practice
http://www.codinghorror.com/blog/2007/08/url-shortening-hashes-in-practice.html

Crawling the Web
dollar.biz.uiowa.edu/~pant/Papers/crawling.pdf

布隆算法在url去重中的应用_菩提树_新浪博客 very nice site
http://blog.sina.com.cn/s/blog_7165756b0100odeu.html
K值聚类在同价位中的应用_菩提树_新浪博客
http://blog.sina.com.cn/s/blog_7165756b0100odij.html
布隆过滤器在网页去重中的应用-泪下的天空-我的搜狐 nice site C++實作 可看
http://jinyun2012.blog.sohu.com/163477317.html
Url排重Bloom Filter 算法、误差及其他 - 我要去桂林—田春峰的IT网志 - IT改进生活
http://blog.donews.com/accesine/archive/2007/01/23/1118640.aspx
larbin中URL的去重-Bloom Filter算法 - piziwang - ITeye技术网站
http://piziwang.iteye.com/blog/740394
Larbin : Parcourir le web, telle est ma passion
http://larbin.sourceforge.net/index-eng.html

THE VERY SIMPLE HASH TABLE EXAMPLE (Java, C++) | Algorithms and Data Structures
http://www.algolist.net/Data_structures/Hash_table/Simple_example
Python dictionary implementation | Laurent Luce's Blog
http://www.laurentluce.com/posts/python-dictionary-implementation/

nutch源代码阅读心得 - CookStar - 博客园
http://www.cnblogs.com/clarkchen/archive/2011/02/22/1960892.html




z

Thursday, November 24, 2011

Daily Bookmarks 20111124

十道海量数据处理面试题与十个方法大总结 - 结构之法 算法之道 - 博客频道 - CSDN.NET
http://blog.csdn.net/v_JULY_v/article/details/6279498
Jonathan Ellis's Programming Blog - Spyced: All you ever wanted to know about writing bloom filters
http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
十一、从头到尾彻底解析Hash表算法 - 结构之法 算法之道 - 博客频道 - CSDN.NET
http://blog.csdn.net/v_JULY_v/article/details/6256463


z

Wednesday, November 23, 2011

Daily Bookmarks 20111123

Bloom Filter « dev.poga.tw
http://devpoga.wordpress.com/2010/01/07/bloom-filter/
Bloom filter - 使用 Hash 來判斷元素是否存在於一個集合中 @ 第二十四個夏天後 :: 痞客邦 PIXNET ::
http://changyy.pixnet.net/blog/post/22946355-bloom-filter---%E4%BD%BF%E7%94%A8-hash-%E4%BE%86%E5%88%A4%E6%96%B7%E5%85%83%E7%B4%A0%E6%98%AF%E5%90%A6%E5%AD%98%E5%9C%A8%E6%96%BC%E4%B8%80
Sigma » Hash和Bloom Filter
http://www.sigma.me/2011/09/13/hash-and-bloom-filter.html
海量数据处理之Bloom Filter详解 - 结构之法 算法之道 - 博客频道 - CSDN.NET
http://blog.csdn.net/v_july_v/article/details/6685894
基于Bloom-Filter算法的URL过滤器的实现 - 黑色的菠菜 - ITeye技术网站
http://lywybo.iteye.com/blog/787337
Nothing: Bloom Filter nice site
http://wnicholas.blogspot.com/2010/04/bloom-filter.html
Bloom Filter(python版)_学思之家_百度空间 good site
http://hi.baidu.com/ruclin/blog/item/6eceed50e2124715367abe94.html/cmtid/7c5edad70066212006088b64
Hash lookups, Database lookups, and Scalability
http://www.perlmonks.org/?node_id=404105

Bloom Filter Resources | BitWorking | Joe Gregorio
http://bitworking.org/news/380/bloom-filter-resources
标  题: 大数据量,海量数据 处理方法总结
http://bbs.xjtu.edu.cn/BMYAJBDVQSTVHSJUADPOGJEVMYLABIFCXFQP_B/con?B=Algorithm&F=M.1259224358.A&N=3682&T=0
jython - Modern, high performance bloom filter in Python? - Stack Overflow
http://stackoverflow.com/questions/311202/modern-high-performance-bloom-filter-in-python
搜索引擎中的URL散列 - Linux环境高级编程 - 博客频道 - CSDN.NET
http://blog.csdn.net/21aspnet/article/details/6596746
网络爬虫设计——URL去重存储库设计_守护地下铁_百度空间
http://hi.baidu.com/shirdrn/blog/item/40ed0fb1ceac4d5c0923029d.html
图片服务器的url hash架构 - Linux环境高级编程 - 博客频道 - CSDN.NET
http://blog.csdn.net/21aspnet/article/details/6596745
节约内存:Instagram的Redis实践 - NoSQLFan - 关注NoSQL相关技术、新闻
http://blog.nosqlfan.com/html/3379.html

Tuesday, November 22, 2011

Sunday, November 20, 2011

Daily Bookmarks 20111120

Magento 导出Product数据提交到 Google Products-用代码导出XML | mivec
http://mivec.ixsbn.com/archives/magento-export-product-and-submit-to-google-product-part-ii
one tip for title optimisation|Data Feed|每天进步一点点
http://holabell.com/googleseo/data-feed/one-tip-for-title-optimisation
Google Product Search 结果显示 YouTube 评测视频 | 谷奥——探寻谷歌的奥秘
http://www.guao.hk/posts/youtube-video-reviews-now-part-of-google-product-search.html
Google力圖通過新的商戶規則來提升Product Search的搜索質量 | 谷飯
http://goofan.net/6578/google-sought-new-business-rules-to-improve-search-quality-product-search.html
Google學習百度,力推其購物搜索?--KPN關鍵字SEO服務
http://kpnweb.com/news_content.asp?n_id=114
迎接圣诞假期购物狂潮,Google Product Search 全面升级 | 谷奥——探寻谷歌的奥秘
http://www.guao.hk/posts/google-product-search-updates-for-holiday-seasons.html#more-13767
商品 Feed 规范 - Google Merchant Center帮助
http://www.google.com/support/merchants/bin/answer.py?hl=zh-Hans&answer=188494#other
百度商品搜索常见问题-网页搜索帮助
http://open.baidu.com/coop/productQuestion.html#01
百度开放平台_开放类目详细信息
http://open.baidu.com/cms/static/category.html?name=goods#category_guide_8
怎样生成百度开放平台xml文件_百度经验
http://jingyan.baidu.com/article/6079ad0e49c6d228ff86dbd5.html

http://www.google.com/basepages/producttype/taxonomy.en-US.txt
PChome EC搜尋引擎,跨站台找商品!
http://briian.com/?p=6119
Fun3C - 最方便精準的3C比價網站
http://fun3c.lingtelli.com/
《數位之牆》網擎Openfind搶先推出台灣首家商品搜尋服務
http://www.digitalwall.com/scripts/displaypr.asp?UID=1641

Wednesday, November 16, 2011

Monday, November 07, 2011

Daily Bookmarks 20111107

基于词典的正向最大匹配中文分词算法,能实现中英文数字混合分词 - lucene + hadoop 分布式并行计算搜索框架 - BlogJava
http://www.blogjava.net/nianzai/archive/2011/08/04/355786.html
全文内容推荐引擎之中文分词 - 51CTO.COM
http://database.51cto.com/art/201108/284085.htm
台扣啵的研究日誌:[PHP] Zend_Search_Lucene中文分詞實做 - 樂多日誌
http://blog.roodo.com/taikobo0/archives/6027073.html
中文分词入门之最大匹配法 | 我爱自然语言处理
http://www.52nlp.cn/maximum-matching-method-of-chinese-word-segmentation
MIT自然语言处理第二讲:单词计数(第四部分) | 我爱自然语言处理
http://www.52nlp.cn/mit-nlp-second-lesson-word-counting-fourth-part
搜索&广告 « 自娱自乐
http://xjzhou.wordpress.com/category/machinelearning/%E6%90%9C%E7%B4%A2%E5%B9%BF%E5%91%8A/
Google 黑板报 - Google (谷歌)中国的博客网志,走近我们的产品、技术和文化: 数学之美 系列二 -- 谈谈中文分词
http://www.google.com.hk/ggblog/googlechinablog/2006/04/blog-post_2507.html
谷歌浏览器 Chrome 里牛逼的中文分词 - 杂项其他 - python.cn(news, jobs)
http://simple-is-better.com/news/319
代码分享-层叠法计算文本相似度(算法/数据结构) -by TY -pythoner.net
http://pythoner.net/code/31/

一个简单的srt字幕多行转单行的脚本[Python] | Felix's Blog
http://blog.felixc.at/2010/07/srt-multiline-convert-python/
Python: 纯文本转PNG | Felix's Blog
http://blog.felixc.at/2011/05/python-text-to-png/
Python 新浪微博 各种表情使用频率 - L Cooper - 博客园
http://www.cnblogs.com/Lannik/archive/2011/10/21/2219776.html

Sunday, November 06, 2011

Daily Bookmarks 20111105

基于JSON格式数据的Ajax分页实现 « 老韩
http://www.handaoliang.com/article_94.html
基于jquery Json Ajax实现的实用的搜索与分页效果代_HTML5,JS代码,Div+CSS,CSS3,酷站欣赏 - 网页前端吧 - 开心工作,快乐生活
http://www.jscss8.com/jsad/qitadaima/20100729_6258.html
How to use jQuery to paginate JSON data? - Stack Overflow
http://stackoverflow.com/questions/2507844/how-to-use-jquery-to-paginate-json-data
jQuery Pagination Ajax分页插件中文详解 « 张鑫旭-鑫空间-鑫生活
http://www.zhangxinxu.com/wordpress/2010/01/jquery-pagination-ajax%E5%88%86%E9%A1%B5%E6%8F%92%E4%BB%B6%E4%B8%AD%E6%96%87%E8%AF%A6%E8%A7%A3/
Pagination | jQuery Plugins
http://plugins.jquery.com/project/pagination
jQuery plugin: Tablesorter 2.0 - Pager plugin
http://tablesorter.com/docs/example-pager.html
Making a jQuery pagination system | web enavu
http://web.enavu.com/tutorials/making-a-jquery-pagination-system/
Lighty RoR: Pagination :讓分頁不再繁瑣
http://lightyror.thegiive.net/2006/11/pagination.html
How to Paginate Data with PHP | Nettuts+
http://net.tutsplus.com/tutorials/php/how-to-paginate-data-with-php/



P->NP->NP-complete-NP-hard问题之浅析 - 张林林|深蓝(Linlin Zhang,shenlan211314) 的专栏 - 博客频道 - CSDN.NET
http://blog.csdn.net/shenlan211314/article/details/6232472
~Paganini Amadeus' Notebook~: P vs NP vs NP-Hard vs NP-Complete - yam天空部落
http://blog.yam.com/hn12303158/article/19947114


Infinite Loop: 【筆記】建立下載 Youtube 影片的 Chrome Extension
http://program-lover.blogspot.com/2011/08/youtube-chrome-extension.html



Thursday, November 03, 2011

Daily Bookmarks 20111103

The stringlib Library
http://effbot.org/zone/stringlib.htm
python - improving Boyer-Moore string search - Stack Overflow
http://stackoverflow.com/questions/1106112/improving-boyer-moore-string-search
elastic search,又一个基于lucene的nosql好项目 | summersmile1984 的个人站点
http://summersmile1984.i-branding.me/2011/03/31/elastic-search%E5%8F%88%E4%B8%80%E4%B8%AA%E5%9F%BA%E4%BA%8Elucene%E7%9A%84nosql%E5%A5%BD%E9%A1%B9%E7%9B%AE/
[projects] Contents of /python/trunk/Objects/stringlib/fastsearch.h
http://svn.python.org/view/python/trunk/Objects/stringlib/fastsearch.h?revision=77470&view=markup
Lucid Imagination » Exploring Lucene’s Indexing Code: Part 2
http://www.lucidimagination.com/blog/2009/03/18/exploring-lucenes-indexing-code-part-2/
Delve inside the Lucene indexing mechanism
http://www.ibm.com/developerworks/library/wa-lucene/

How to Index PDF Documents with Lucene | kalani's Tech blog
http://kalanir.blogspot.com/2008/08/indexing-pdf-documents-with-lucene.html
elasticsearch - - Open Source, Distributed, RESTful, Search Engine
http://www.elasticsearch.org/
Study notes 4.3 - Document filtering: Use Naive Bayes_土老冒_百度空间
http://hi.baidu.com/idontknow1987/blog/item/f36adcc5e5e87da48326ac4b.html
PyLucene安装使用简介 | 非鱼观点-互联网观察
http://www.unfish.net/archives/269-20080118.html
绚丽也尘埃 » PyLucene in Action
http://www.fuzhijie.me/?p=273
SourceForge.net: Benchmarks - clucene
http://sourceforge.net/apps/mediawiki/clucene/index.php?title=Benchmarks
Django and Lupy
http://www.rkblog.rk.edu.pl/w/p/django-lupy/

Xapian performance comparision with Whoosh « Searching with Xapian
http://xapian.wordpress.com/2009/02/12/xapian-performance-comparision-with-whoosh/

xapwrap - xapian php调用包装程序支持中文检索 - Google Project Hosting
http://code.google.com/p/xapwrap/


利用 xapian 建立索引 (python 版) - 系统架构 - python.cn(news, jobs)
http://simple-is-better.com/news/619


Stemming Algorithm - 荡气回肠,奔流不息 - tayoto - 和讯博客
http://tayoto.blog.hexun.com/38957815_d.html


在线演示|中文分词|PHP中文分词 - 开源免费的简易中文分词系统
http://www.ftphp.com/scws/demo.php

关于 xunsearch - 迅搜(xunsearch) - 开源免费中文全文搜索引擎
http://www.xunsearch.com/about

纵横搜索
http://discuz.qq.com/service/search

中文分词 « 神仙的仙居
http://xiezhenye.com/tag/%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D

Python 中文分词:用纯python实现 / FMM 算法 / pymmseg-cpp / smallseg / judou 句读 / BECer-GAE - 杂项其他 - python.cn(news, jobs)
http://simple-is-better.com/news/387






Wednesday, November 02, 2011

Daily Bookmarks 20111102

深入 Lucene 索引机制
http://www.ibm.com/developerworks/cn/java/wa-lucene/
实战 Lucene,第 1 部分: 初识 Lucene
http://www.ibm.com/developerworks/cn/java/j-lo-lucene1/index.html

python
理解Python中的装饰器 | 代码回音
http://www.codecho.com/understanding-python-decorators/#more-131893
Python折半搜索算法(二分法) | 代码回音
http://www.codecho.com/python-binary-search-algorithm/
Python中如何使用*args 和 **kwargs | 代码回音
http://www.codecho.com/how-to-use-args-and-kwargs-in-python/
Python: speed vs. memory tradeoff reading files | handyfloss
http://handyfloss.net/2008.02/python-speed-vs-memory-tradeoff-reading-files/
Reading Text Files
http://effbot.org/zone/readline-performance.htm

c++ &的用法 - Johnson的研究筆記
https://sites.google.com/site/johnsonsnote/c-c-c-xue-xi-bi-ji/c-de-yong-fa

介绍一下Hyper Estraier - 陈叙远 - 博客园
http://www.cnblogs.com/jjstar/archive/2006/12/08/586531.html
tokyo cabinet - Google 搜尋
http://www.google.com.tw/search?q=tokyo+cabinet&hl=zh-TW&prmd=imvns&ei=TDSxTuHDMerxmAWo1_ySAg&start=10&sa=N&biw=1440&bih=809

Tokyo Cabinet:另一个DBM实现 | 互联网,请记住我
http://www.162cm.com/archives/681.html
Tokyo Cabinet Key Value数据库及其扩展应用
http://webcache.googleusercontent.com/search?q=cache:5GpMohnGt-MJ:www.slideshare.net/rewinx/tokyo-cabinet-key-value+tokyo+cabinet&cd=14&hl=zh-TW&ct=clnk&gl=tw
开源搜索引擎Hyper Estraier性能小测&缺点总结--覃健祥 | chin at blogchina
http://chin.bokee.com/6784704.html

Efficient substring searching – Phusion Corporate Blog
http://blog.phusion.nl/2010/12/06/efficient-substring-searching/
Making Python grep
http://casa.colorado.edu/~ginsbura/pygrep.htm
Obtain substring using python. (Page 1) / Programming & Scripting / Arch Linux Forums
https://bbs.archlinux.org/viewtopic.php?id=126959
Find All Indices of a SubString in a Given String « Python recipes « ActiveState Code
http://code.activestate.com/recipes/499314-find-all-indices-of-a-substring-in-a-given-string/

Tuesday, November 01, 2011

Daily Bookmarks 20111101

 MapReduce-免费午餐还没有结束? - 匪夷所思的人 - ITeye技术网站
http://banditjava.iteye.com/blog/246160
搜索引擎术语 - 匪夷所思的人 - ITeye技术网站
http://banditjava.iteye.com/blog/253184
Lucene倒排索引原理 - 唯以不永伤
http://liubingnlp.appspot.com/?p=10003

Using SQLite from Python
http://www.comp.mq.edu.au/units/comp249/pythonbook/pythoncgi/pysqlite.html
Python获取Yahoo天气 | 代码回音
http://www.codecho.com/fetching-yahoo-weather-using-python/
Python中使用Sqlite | 代码回音
http://www.codecho.com/using-sqlite-in-python/
SQLite Python tutorial Good site
http://zetcode.com/db/sqlitepythontutorial/
Command Line Shell For SQLite
http://www.sqlite.org/sqlite.html
PHP 程式 學習 筆記本: [引用]MongoDB入門簡介
http://calos-tw.blogspot.com/2010/03/mongodb.html

Build MongoDB on FreeBSD -- for Jenkins use — Koansys
http://koansys.com/tech/build-mongodb-on-freebsd-for-jenkins-use
小默的研究中心 » 集体智慧
http://wpxiaomo.sinaapp.com/?cat=29
python - pysqlite2: ProgrammingError - You must not use 8-bit bytestrings - Stack Overflow
http://webcache.googleusercontent.com/search?q=cache:FDT-qryM438J:stackoverflow.com/questions/2838100/pysqlite2-programmingerror-you-must-not-use-8-bit-bytestrings+&cd=1&hl=zh-TW&ct=clnk&client=firefox