| 1 | = 2012-10-25 = |
| 2 | |
| 3 | == Hadoop World 2012 (Keynotes) == |
| 4 | |
| 5 | * 9:10am '''Hadoop: Thinking Big''' - John Schroeder (MapR Technologies) |
| 6 | * MapR breaks Terasort benchmark record on Google Compute Engine |
| 7 | * 9:20am Plenary |
| 8 | * '''Beyond Batch''' - Doug Cutting (Cloudera) |
| 9 | * HBase: First Non-Batch Component |
| 10 | * Google Give US Map - 2012 Spanner Paper , 26 authors! |
| 11 | * Cloudera Impala (2010) -> Google Dremel (2010) : online queries !! |
| 12 | * 9:30am Plenary |
| 13 | * '''Cloud, Mobile and Big Data – How Analytics Provides Value to the Buzzwords''' - Paul Kent (SAS) |
| 14 | * 讓企業可以更即時地做出決策 - Action in Time |
| 15 | * Predicting Future outcomes |
| 16 | * 9:35am Plenary |
| 17 | * '''They Don't Teach You That In School''' - Cathy O'Neil, Julie Steele (O'Reilly Media, Inc.) |
| 18 | * What is the requirement of Data Scientist - Machine Learning, Statistics |
| 19 | * Feature Selection - Machine Learning for Ad. |
| 20 | * 9:45am Plenary |
| 21 | * '''From Traditional Database to Big Data Platform''' - Irfan Khan (SAP) |
| 22 | * 9:50am Plenary |
| 23 | * '''Of Rocket Ships and Washing Machines: Data Technology for People''' - Joe Hellerstein (Trifacta and UC Berkeley) |
| 24 | * 就像洗碗機的發明,我們還在很早期的資料科學發展階段,因為八成的資料處理工作都在整理資料 - 80% work is in cleaning the data |
| 25 | * Develop productivity technology |
| 26 | * Shreddr - http://www.captricity.com |
| 27 | * Analytic Trifecta |
| 28 | * 10:00am Plenary |
| 29 | * '''Are We Really Winning the Information Revolution?''' - Samantha Ravich (National Commission for the Review of R&D Programs in the Intelligence Community) |
| 30 | * 我們骨子裡知道答案就在那一堆資料裡,然而現在我們有太多太多的資料了。 |
| 31 | * 資料太多,必須要透過選擇、考慮優先權,才有辦法真正從中得到洞見,做出正確的決策。 |
| 32 | |
| 33 | == Hadoop World 2012 (Sessions) == |
| 34 | |
| 35 | * 10:50am '''Performing Data Science with HBase''' - Aaron Kimball (WibiData) |
| 36 | * Crunch : MapReduce pipelines for python and Scala - Apache Project |
| 37 | * PCollections : Crunch data sets (P stands for Parallel) |
| 38 | * 11:40am '''Upcoming Enterprise features in Apache HBase 0.96''' - Jonathan Hsieh (Cloudera, Inc) |
| 39 | * '''NOTE: Very Nice slides for Enterprise who plan to use HBase. It will tell you what should you prepare and the required architecture.''' |
| 40 | * Risk = downtime + data lost |
| 41 | * Production System need to avoid risk |
| 42 | * Risks from within the cluster |
| 43 | * Unplanned Maintenanace - Hardware / Software Error - Detection Time + Recovery Time |
| 44 | * automated metadata repairs with ```hbck (0.92)`` |
| 45 | * 0.92/0.94 - 180s to detect Region Server Failure, 0.96 - 0~1s to detect Region Server Failure |
| 46 | * Planned Maintanace - Use NameNode HA + ZK to solve the problem |
| 47 | * Risks from outside the cluster |
| 48 | * Amazon 停電問題, Backhoe : the true cyberthreat 怪手才是真正的網路威脅! |
| 49 | * HBase Support Batch Backups - (1) Export / Dist CP / Import (2) Copy Table (異地備援) |
| 50 | * HBase replication (0.92+) - (1) Master - Slave (0.90) (2) Master - Master (0.92) |
| 51 | * Risks from User |
| 52 | * User Err - Ex. drop 'table' |
| 53 | * 解法一:User Level Security (Access Control) - based on Kerbose |
| 54 | * 解法二:Table Snapshot (0.96+) |
| 55 | * 13:40pm '''Designing Scalable Network Architectures for Fast Moving Big Data''' - Kenneth Duda (Arista Networks), Amr Awadallah (Cloudera, Inc.) |
| 56 | * 如何設計大型 Hadoop 叢集的網路架構 |
| 57 | * (Think: 講者提到 Buffer 對 Hadoop 效能的影響, 所以在調校) |
| 58 | * ZTP (Zero-Touch Provisioning) - 用來控制 Switch 設定 .... Hmmm... Cool~ ( 用在 eBay ) |
| 59 | * MLAG for High Availability - 網路的 HA .... Cool Feature ~ |
| 60 | * Fast Server Failover - 如何根據 ICMP 封包的狀態,來判斷伺服器已經離線,為何要等到 OS 判斷呢?直接讓 Switch 告訴連線來源吧~ |
| 61 | * eOS : 可以在 Switch 上安裝常用的監控軟體(Ex. Ganglia, Nagios, fping, etc.) |
| 62 | * (Think: 這是需求的最開始規劃階段應該思考的問題, 考慮 MapReduce 跟 HDFS -> 多少計算、儲存,但是網路常常會被忽略 -> Switch 選擇與監控支援. It's all about SCALE!!) |
| 63 | * QoS 支援 - 這些問題都是在非常大型的環境裏面才會發生 |
| 64 | * [http://www.openflow.org OpenFlow] (SDN, Software Defined Network)對 Hadoop 環境的影響 - 為了 Data Locality / Rack Aware 過去必須要靠人工設定 |
| 65 | * 14:30pm '''Is Your Cluster a Leaning Tower of Pisa?''' - Michael Segel (Think Big Analytics) |
| 66 | * |
| 67 | * 16:10pm '''Real-time Big Data Without Streaming''' - Ron Bodkin (Think Big Analytics) |
| 68 | * 17:00pm '''Realtime Processing with Storm''' - Gabriel Eisbruch (mercadolibre.com), Luis Darío Simonassi (mercadolibre.com), Jonathan Leibiusky (mercadolibre.com) |