= 目的 = This program will parse your apache log and store it into Hbase. = 如何使用 = * 1. Upload apache logs ( /var/log/apache2/access.log* ) to hdfs (default: /user/waue/apache-log) \ {{{ $ bin/hadoop dfs -put /var/log/apache2/ apache-log }}} * 2. parameter "dir" in main contains the logs. * 3. you should filter the exception contents manually, {{{ ex: ::1 - - [29/Jun/2008:07:35:15 +0800] "GET / HTTP/1.0" 200 729 "... }}} = 結果 = 1 執行以下指令 {{{ hql > select * from apache-log; }}} 2 結果 {{{ +-------------------------+-------------------------+-------------------------+ | Row | Column | Cell | +-------------------------+-------------------------+-------------------------+ | 118.170.101.250 | http:agent | Mozilla/4.0 (compatible;| | | | MSIE 4.01; Windows 95) | +-------------------------+-------------------------+-------------------------+ | 118.170.101.250 | http:bytesize | 318 | +-------------------------+-------------------------+-------------------------+ ..........(skip)........ +-------------------------+-------------------------+-------------------------+ | 87.65.93.58 | http:method | OPTIONS | +-------------------------+-------------------------+-------------------------+ | 87.65.93.58 | http:protocol | HTTP/1.1 | +-------------------------+-------------------------+-------------------------+ | 87.65.93.58 | referrer:- | * | +-------------------------+-------------------------+-------------------------+ | 87.65.93.58 | url:* | - | +-------------------------+-------------------------+-------------------------+ 31 row(s) in set. (0.58 sec) }}} = LogParser.java = 這個java檔的任務是分析log檔案中的每行資訊 {{{ private static Pattern p = Pattern .compile("([^ ]*) ([^ ]*) ([^ ]*) \\[([^]]*)\\] \"([^\"]*)\"" + " ([^ ]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\".*"); }}} 首先先宣告產生一個物件 [http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html java.util.regex.Pattern] 這個類別沒有建構子,因此宣告出來之後用compile(String regex)敘述來建立滿足正規表示式的物件,功能說明:[[br]] Compiles the given regular expression into a pattern.[[br]] 將正規表示式的字串當引數輸入之後,就可以得到一個p的Pattern物件,而此正規表示式:[[br]] '''([^ ]*) ([^ ]*) ([^ ]*) \\[([^]]*)\\] \"([^\"]*)\" ([^ ]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\".*''' [[br]] 對應到的apache log格式為:[[br]] ''140.110.138.176 - - [02/Jul/2008:16:55:02 +0800] "GET /hbase-0.1.3.zip HTTP/1.0" 200 10249801 "-" "Wget/1.10.2" ''''[[br]] 在此可以把Pattern 當成是一個雛型類別,用compiler(表示式) 則告知了 以"表示式"為規則產生一個p的模板出來 ---------------------------- {{{ public LogParser(String line) throws ParseException, Exception{ Matcher matcher = p.matcher(line); if(matcher.matches()){ this.ip = matcher.group(1); // IP address of the client requesting the web page. if(isIpAddress(ip)){ SimpleDateFormat sdf = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss Z",Locale.US); this.timestamp = sdf.parse(matcher.group(4)).getTime(); String[] http = matcher.group(5).split(" "); this.method = http[0]; this.url = http[1]; this.protocol = http[2]; this.code = matcher.group(6); this.byteSize = matcher.group(7); this.referrer = matcher.group(8); this.agent = matcher.group(9); } } } }}} 接著定義建構子,宣告了一個 [http://java.sun.com/javase/6/docs/api/java/util/regex/Matcher.html java.util.regex.Matcher] 此物件可以用來與之前的 Pattern搭配。[[br]] 剛剛宣告的模板p有個函數 matcher(String) ,此功能會將材料(String敘述 )壓印成模板的形狀,並把這個壓出物件叫做matcher。 之後要取用matcher的第n段,只要用matcher.group(n)就可以把第n段的內容以String的形式取回。[[br]] 回頭對照傳近來的內容 || 1 || 2 || 3 || 4 || 5 || 6 || 7 || || ip || - || - || 時間 || "http " || 回傳碼 || 長度 || "指引" || "代理器" || || 140.110.138.176 || - || - || [02/Jul/2008:16:55:02 +0800] || "GET /hbase-0.1.3.zip HTTP/1.0" || 200 || 10249801 || " -" || "Wget/1.10.2" || [[br]] {{{ public static boolean isIpAddress(String inputString) { StringTokenizer tokenizer = new StringTokenizer(inputString, "."); if (tokenizer.countTokens() != 4) { return false; } try { for (int i = 0; i < 4; i++) { String t = tokenizer.nextToken(); int chunk = Integer.parseInt(t); if ((chunk & 255) != chunk) { return false; } } } catch (NumberFormatException e) { return false; } if (inputString.indexOf("..") >= 0) { return false; } return true; } } }}}