= 2010-11-24 = == Hadoop & GPFS == * [http://www.enterprisestorageforum.com/technology/news/article.php/3913996/IBM-Builds-on-Hadoop-with-New-Storage-Architecture.htm IBM Builds on Hadoop with New Storage Architecture] - IBM 讓 GPFS 可以跟 HDFS 合作,提供更高可用度的儲存服務。稱之為『General Parallel File System-Shared Nothing Cluster (GPFS-SNC)』 * [http://www.zdnet.com.tw/news/hardware/0,2000085676,20148269,00.htm IBM發表新儲存架構 GPFS-SNC] == Public Large Data Set 公用大型資料集 == * 維基百科 - http://en.wikipedia.org/wiki/Wikipedia:Database_download * 亞馬遜提供的公用資料 - http://aws.amazon.com/publicdatasets/ * 包括基因資料(Genome, Ex. 1000 Genome Project) * http://www.statmt.org/europarl/ * http://www.opendatacenteralliance.org/ * [http://www.data.gov/raw/92 Data.gov] - 美國公部門的資料 - 台灣應該可以上 [http://www.archives.gov.tw/ 檔案管理局] ([wiki:jazz/10-11-08 2010-11-08]) * http://ckan.org/ * http://stat-computing.org/dataexpo/2009/ - 飛航紀錄 - 1987-2008 years of airlines performance data. 1GB, 12M records. * http://kdd.ics.uci.edu/ - UCI Knowledge Discovery in Databases Archive - The data sets are categorized neatly. These are very useful for many machine-learning exercises with the use of Hadoop/Mahout.