Context Navigation

← Previous Version
View Latest Version
Next Version →

Version 1 (modified by jazz, 13 years ago) (diff)
--

2012-10-25

Hadoop World 2012 (Keynotes)

9:10am Hadoop: Thinking Big - John Schroeder (MapR Technologies)
- MapR breaks Terasort benchmark record on Google Compute Engine
9:20am Plenary
Beyond Batch - Doug Cutting (Cloudera)
- HBase: First Non-Batch Component
- Google Give US Map - 2012 Spanner Paper , 26 authors!
- Cloudera Impala (2010) -> Google Dremel (2010) : online queries !!
9:30am Plenary
Cloud, Mobile and Big Data – How Analytics Provides Value to the Buzzwords - Paul Kent (SAS)
- 讓企業可以更即時地做出決策 - Action in Time
- Predicting Future outcomes
9:35am Plenary
They Don't Teach You That In School - Cathy O'Neil, Julie Steele (O'Reilly Media, Inc.)
- What is the requirement of Data Scientist - Machine Learning, Statistics
- Feature Selection - Machine Learning for Ad.
9:45am Plenary
From Traditional Database to Big Data Platform - Irfan Khan (SAP)
9:50am Plenary
Of Rocket Ships and Washing Machines: Data Technology for People - Joe Hellerstein (Trifacta and UC Berkeley)
- 就像洗碗機的發明，我們還在很早期的資料科學發展階段，因為八成的資料處理工作都在整理資料 - 80% work is in cleaning the data
- Develop productivity technology
- Shreddr - http://www.captricity.com
- Analytic Trifecta
10:00am Plenary
Are We Really Winning the Information Revolution? - Samantha Ravich (National Commission for the Review of R&D Programs in the Intelligence Community)
- 我們骨子裡知道答案就在那一堆資料裡，然而現在我們有太多太多的資料了。
- 資料太多，必須要透過選擇、考慮優先權，才有辦法真正從中得到洞見，做出正確的決策。

Hadoop World 2012 (Sessions)

10:50am Performing Data Science with HBase - Aaron Kimball (WibiData?)
- Crunch : MapReduce pipelines for python and Scala - Apache Project
- PCollections : Crunch data sets (P stands for Parallel)
11:40am Upcoming Enterprise features in Apache HBase 0.96 - Jonathan Hsieh (Cloudera, Inc)
- NOTE: Very Nice slides for Enterprise who plan to use HBase. It will tell you what should you prepare and the required architecture.
- Risk = downtime + data lost
- Production System need to avoid risk
  - Risks from within the cluster
    - Unplanned Maintenanace - Hardware / Software Error - Detection Time + Recovery Time
      - automated metadata repairs with hbck (0.92)`
      - 0.92/0.94 - 180s to detect Region Server Failure, 0.96 - 0~1s to detect Region Server Failure
    - Planned Maintanace - Use NameNode HA + ZK to solve the problem
  - Risks from outside the cluster
    - Amazon 停電問題, Backhoe : the true cyberthreat 怪手才是真正的網路威脅!
    - HBase Support Batch Backups - (1) Export / Dist CP / Import (2) Copy Table (異地備援)
    - HBase replication (0.92+) - (1) Master - Slave (0.90) (2) Master - Master (0.92)
  - Risks from User
    - User Err - Ex. drop 'table'
    - 解法一：User Level Security (Access Control) - based on Kerbose
    - 解法二：Table Snapshot (0.96+)
13:40pm Designing Scalable Network Architectures for Fast Moving Big Data - Kenneth Duda (Arista Networks), Amr Awadallah (Cloudera, Inc.)
- 如何設計大型 Hadoop 叢集的網路架構
- (Think: 講者提到 Buffer 對 Hadoop 效能的影響, 所以在調校)
- ZTP (Zero-Touch Provisioning) - 用來控制 Switch 設定 .... Hmmm... Cool~ ( 用在 eBay )
- MLAG for High Availability - 網路的 HA .... Cool Feature ~
- Fast Server Failover - 如何根據 ICMP 封包的狀態，來判斷伺服器已經離線，為何要等到 OS 判斷呢？直接讓 Switch 告訴連線來源吧~
- eOS : 可以在 Switch 上安裝常用的監控軟體(Ex. Ganglia, Nagios, fping, etc.)
- (Think: 這是需求的最開始規劃階段應該思考的問題, 考慮 MapReduce 跟 HDFS -> 多少計算、儲存，但是網路常常會被忽略 -> Switch 選擇與監控支援. It's all about SCALE!!)
- QoS 支援 - 這些問題都是在非常大型的環境裏面才會發生
- OpenFlow (SDN, Software Defined Network)對 Hadoop 環境的影響 - 為了 Data Locality / Rack Aware 過去必須要靠人工設定
14:30pm Is Your Cluster a Leaning Tower of Pisa? - Michael Segel (Think Big Analytics)
16:10pm Real-time Big Data Without Streaming - Ron Bodkin (Think Big Analytics)
17:00pm Realtime Processing with Storm - Gabriel Eisbruch (mercadolibre.com), Luis Darío Simonassi (mercadolibre.com), Jonathan Leibiusky (mercadolibre.com)

Download in other formats:

Plain Text