= 2012-10-24 = == Hadoop World 2012 (Keynotes) == * '''Big Answers''' - Mike Olson (Cloudera) * http://cloudera.com/impla - Apache License , Realtime Query Engine * Application : Agriculture , Social Security * '''The End of the Data Warehouse''' - Ben Werther (Platfora) * Add Datawarehouse between Web Interface and Hadoop * In-Memory BI * '''Moneyball for New York City''' - Michael Flowers (NYC Mayor's Office of Policy and Strategic Planning) * Introduce "Data Sensing Lab" in the venue. (based on Arduino) * '''Smart City''' - How to use NYC data more effectively * Fire Risk Inspection Analytic - 預測火災的風險有多高 -> 有效地降低了 * 25 Years old staff , Start on 2005 PCs and MS Excel * Use Web (Wall) for collaberation editing * Grow and Enable Culture of Data-driven * '''Thinking Big Together: Driving the Future of Data Science''' - Annika Jimenez (EMC Greenplum), Anthony Goldbloom (Kaggle) * Data, Technology, Application, Data Science * http://openchorus.org * Kaggle - in short of professional exprt - 每次會議都會說公司在徵才,也代表這個領域的人才缺乏。 * http://greenplum.com/communities * '''The Composite Database''' - Rich Hickey (Datomic) * Big Data 雜誌 - * Strata - 9:1 -> 投稿:接受 * breakdown tranditional database : storage, index, query * Indexing as a component : Storage -> Indexing -> Ordered Storage * Uqery as a component : * Coordination as component * Other component : notification, memory indexes (liveness), caching * '''The Democratization of Big Data: Bringing Hadoop to the Masses''' - James Markarian (Informatica) * 用分子表示是說明 Hadoop Ecosystem 的關聯性 * coding model work, but hadoop ecosystem need a friendly interface * '''Big Data Direct – The Era of Self-driven Big Data Exploration Sharmila''' - Shahani-Mulligan (!ClearStory Data) * Ever-increasing complexity of sources - growing open data api - 資料的複雜度(complexity)愈來愈高 * 混合 private data 與 public data 可以得到許多洞見(insight) - 舉了不同的案例(Ex. 水管, Web Site SEO) * $35 billion industry for databases and data containers * Future : new era to analyze big data easilly and balance it with judgment and experience * Solution Must Aid Human Insight - Big Data + Amplified Human Intelligence * Next Generation Visualization - Only One Part of the Answer ( 一張圖也只是為了解釋某個答案的一部份) * '''Bringing the 'So What' to Big Data''' - Tim Estes (Digital Reasoning) * Machine Reading problem * It's about People understanding data to change and impove their lives * Big understanding > Big Data * Key : Machine Learning , Bare Metal, Application * People > Data * Ex. 支持國家安全的分析, 確保自由。 * Ex. 從廣告找出人蛇集團,拯救未成年少女被綁架。 * Ex. 降低金融風險 * Mission > Consumerism * Three Key mission: (1) Security and Freedom of World (2) (3) == Hadoop World 2012 (Sessions) == * 10:50am Wednesday, 10/24/2012 * '''MapReduce Design Patterns''', Donald Miner (EMC Greenplum) * 這本書感覺上是本不錯的程式設計參考指南,特別是會把該 Pattern 的特徵也用 SQL 或 Pig 語法呈獻。對於跨團隊溝通上有蠻好的幫助。 * 11:40am Wednesday, 10/24/2012 * '''Analyzing Millions of GitHub Commits: What Makes Developers Happy, Angry, and Everything in Between?''', Ilya Grigorik (Google), Brian Doll (GitHub) * Ilya 是 Google BigQuery 的開發者(Cool~), Brian 是 Github 的 Marketing * (Ilya) * 動機:追蹤太多專案,很難虧得全貌!'''Global Timeline''' -> only if we can access github archive * http://githubarchive.org - data start from March 2012 * (Think: 未來個人在 githube 的活動也許會成為獵人頭公司判斷這個人程式設計能力的參考 -> 程式設計人員自我行銷的方法) * Dremel (Paper) -> [http://www.google.com/bigquery BigQuery] {{{ $ wget http://data.githubarchive.org/2012-04-11-15.json.gz $ ruby flatten.rb 2012-04-11-15.json.gz > flat.csv.gz $ bg load github.timeline flat.csv.gz }}} * bigquery is a SQL-like syntax * (Think: 可以將 BigQuery 用在 OpenData.TW 的一些應用上) * GitHub + BigQuery + MailChimp -> 每天用 crontab 跑 BigQuery * GitHub Data Challenge (Brian) * Ex. http://octoboard.com * (Think: 分析 github 的整體行為是否可以代表全球 Open Source 活動的特徵? Ex. Private vs Public) * comment emotional -> 寫在註解中的情緒語言 * 蠻有趣的分析 - Programming language associations * 關聯分析 - Github 熱門語言 與 StackOverflow 的問題個數 * activity by country - commits per 100k people (提送數 / 人口數) * 1:40pm Wednesday, 10/24/2012 * '''Facebook’s Large Scale Monitoring System Built on HBase''' - Liyin Tang (Facebook), Vinod Venkataraman (Facebook), Charles Thayer (Facebook) * Facebook ODS * Problem : (1) MySQL table size limitations (2) Sharding scheme created hotspots (3) Data growth * Lesson Learned - Locality : Spliting HBase and HDFS is not good! * Pitfalls * HBase 效能調校 - Takeaway 1~5 * 2:30pm Wednesday, 10/24/2012 * '''Bringing Real-Time, End-to-End Analytics into Everyday Use''', Greg Khairallah (Intel), Vin Sharma (Intel) *