wiki:jazz/12-10-25

Version 2 (modified by jazz, 12 years ago) (diff)

--

2012-10-25

Hadoop World 2012 (Keynotes)

  • 9:10am Hadoop: Thinking Big - John Schroeder (MapR Technologies)
    • MapR breaks Terasort benchmark record on Google Compute Engine
  • 9:20am Beyond Batch - Doug Cutting (Cloudera)
    • HBase: First Non-Batch Component
    • Google Give US Map - 2012 Spanner Paper , 26 authors!
    • Cloudera Impala (2010) -> Google Dremel (2010) : online queries !!
  • 9:30am Cloud, Mobile and Big Data – How Analytics Provides Value to the Buzzwords - Paul Kent (SAS)
    • 讓企業可以更即時地做出決策 - Action in Time
    • Predicting Future outcomes
  • 9:35am They Don't Teach You That In School - Cathy O'Neil, Julie Steele (O'Reilly Media, Inc.)
    • What is the requirement of Data Scientist - Machine Learning, Statistics
    • Feature Selection - Machine Learning for Ad.
  • 9:45am From Traditional Database to Big Data Platform - Irfan Khan (SAP)
  • 9:50am Of Rocket Ships and Washing Machines: Data Technology for People - Joe Hellerstein (Trifacta and UC Berkeley)
    • 就像洗碗機的發明,我們還在很早期的資料科學發展階段,因為八成的資料處理工作都在整理資料 - 80% work is in cleaning the data
    • Develop productivity technology
    • Shreddr - http://www.captricity.com
    • Analytic Trifecta
  • 10:00am Are We Really Winning the Information Revolution? - Samantha Ravich (National Commission for the Review of R&D Programs in the Intelligence Community)
    • 我們骨子裡知道答案就在那一堆資料裡,然而現在我們有太多太多的資料了。
    • 資料太多,必須要透過選擇、考慮優先權,才有辦法真正從中得到洞見,做出正確的決策。

Hadoop World 2012 (Sessions)

  • 10:50am Performing Data Science with HBase - Aaron Kimball (WibiData?)
    • Crunch : MapReduce pipelines for python and Scala - Apache Project
    • PCollections : Crunch data sets (P stands for Parallel)
  • 11:40am Upcoming Enterprise features in Apache HBase 0.96 - Jonathan Hsieh (Cloudera, Inc)
    • NOTE: Very Nice slides for Enterprise who plan to use HBase. It will tell you what should you prepare and the required architecture.
    • Risk = downtime + data lost
    • Production System need to avoid risk
      • Risks from within the cluster
        • Unplanned Maintenanace - Hardware / Software Error - Detection Time + Recovery Time
          • automated metadata repairs with hbck (0.92)`
          • 0.92/0.94 - 180s to detect Region Server Failure, 0.96 - 0~1s to detect Region Server Failure
        • Planned Maintanace - Use NameNode HA + ZK to solve the problem
      • Risks from outside the cluster
        • Amazon 停電問題, Backhoe : the true cyberthreat 怪手才是真正的網路威脅!
        • HBase Support Batch Backups - (1) Export / Dist CP / Import (2) Copy Table (異地備援)
        • HBase replication (0.92+) - (1) Master - Slave (0.90) (2) Master - Master (0.92)
      • Risks from User
        • User Err - Ex. drop 'table'
        • 解法一:User Level Security (Access Control) - based on Kerbose
        • 解法二:Table Snapshot (0.96+)
  • 13:40pm Designing Scalable Network Architectures for Fast Moving Big Data - Kenneth Duda (Arista Networks), Amr Awadallah (Cloudera, Inc.)
    • 如何設計大型 Hadoop 叢集的網路架構
    • (Think: 講者提到 Buffer 對 Hadoop 效能的影響, 所以在調校)
    • ZTP (Zero-Touch Provisioning) - 用來控制 Switch 設定 .... Hmmm... Cool~ ( 用在 eBay )
    • MLAG for High Availability - 網路的 HA .... Cool Feature ~
    • Fast Server Failover - 如何根據 ICMP 封包的狀態,來判斷伺服器已經離線,為何要等到 OS 判斷呢?直接讓 Switch 告訴連線來源吧~
    • eOS : 可以在 Switch 上安裝常用的監控軟體(Ex. Ganglia, Nagios, fping, etc.)
    • (Think: 這是需求的最開始規劃階段應該思考的問題, 考慮 MapReduce 跟 HDFS -> 多少計算、儲存,但是網路常常會被忽略 -> Switch 選擇與監控支援. It's all about SCALE!!)
    • QoS 支援 - 這些問題都是在非常大型的環境裏面才會發生
    • OpenFlow? (SDN, Software Defined Network)對 Hadoop 環境的影響 - 為了 Data Locality / Rack Aware 過去必須要靠人工設定
  • 14:30pm Is Your Cluster a Leaning Tower of Pisa? - Michael Segel (Think Big Analytics)
    • 笑話:醫學系二年級的學生最主要學到的是怎麼問病患問題!!因為好的診斷來自好的問題!!
    • (Think: 這裡舉的問題例子還真像 forum.hadoop.tw 常見的問題,結果要經過兩三次往返才能真正切入問題本身,有時不是叢集架構問題,但有時候還是習慣假設是環境的問題)
    • (Think: CHUG 的 Logo -> 放個台灣來設計個 Taiwan Hadoop User Group Logo)
    • (Think: 企業導入 Hadoop 的流程 Workflow . FAQs , Vendor Supply Chain , ..)
    • Different Type of Cluster - from "on promise" to "CAAS (Cluster as a Service)"
      • CAAS - Redundant Data Centers as an option (異地備援, CDN)
    • DR(Desaster Recovery)/BCP(?)
    • Golden Ratio -
      • CPU cores to Memory - 4~8 GB RAM per Core
      • 1+ Spindles (Hard Drives) per Core
      • > 4 drives 1GBe is not enough (Network)
      • According to Moore -> the optimal ratio will be re-evaluated.
    • Think about TCO (Total Cost of Ownership)!!
    • Using VMs :
      • PRO: Allow Multi-tendency
    • In furture - we expect to see more virtualization
      • Mesos / Spark - Berkly
      • YARN
      • Storm
    • Use VM to keep the ratio 'balance'!!
  • 16:10pm Real-time Big Data Without Streaming - Ron Bodkin (Think Big Analytics)
    • 算是比較高階的架構問題,不同的即時性應用該採用怎樣的架構。
    • 覺得基本上元件就那幾樣(NoSQL, Index, Search, Streaming Server),但是後續更難的應該是把這些元件連接起來的方法(Ex.接頭)。
  • 17:00pm Realtime Processing with Storm - Gabriel Eisbruch (Mercadolibre.Com), Luis Darío Simonassi (MercadoLibre?.Com), Jonathan Leibiusky (MercadoLibre?.com)