wiki:LogParser

Version 11 (modified by waue, 16 years ago) (diff)

--

目的

This program will parse your apache log and store it into Hbase.

如何使用

  • 1. Upload apache logs ( /var/log/apache2/access.log* ) to hdfs (default: /user/waue/apache-log) \
$ bin/hadoop dfs -put /var/log/apache2/ apache-log
  • 2. parameter "dir" in main contains the logs.
  • 3. you should filter the exception contents manually,
    ex:  ::1 - - [29/Jun/2008:07:35:15 +0800] "GET / HTTP/1.0" 200 729 "...
    
    

結果

1 執行以下指令

	hql > select * from apache-log;

2 結果

+-------------------------+-------------------------+-------------------------+

| Row                     | Column                  | Cell                    |

+-------------------------+-------------------------+-------------------------+

| 118.170.101.250         | http:agent              | Mozilla/4.0 (compatible;|

|                         |                         |  MSIE 4.01; Windows 95) |

+-------------------------+-------------------------+-------------------------+

| 118.170.101.250         | http:bytesize           | 318                     |

+-------------------------+-------------------------+-------------------------+

..........(skip)........

+-------------------------+-------------------------+-------------------------+

| 87.65.93.58             | http:method             | OPTIONS                 |

+-------------------------+-------------------------+-------------------------+

| 87.65.93.58             | http:protocol           | HTTP/1.1                |

+-------------------------+-------------------------+-------------------------+

| 87.65.93.58             | referrer:-              | *                       |

+-------------------------+-------------------------+-------------------------+

| 87.65.93.58             | url:*                   | -                       |

+-------------------------+-------------------------+-------------------------+

31 row(s) in set. (0.58 sec)


LogParser.java

這個java檔的任務是分析log檔案中的每行資訊

  private static Pattern p = Pattern
  .compile("([︿ ]*) ([︿ ]*) ([︿ ]*) \\[([︿]]*)\\] \"([︿\"]*)\"" +
                  " ([︿ ]*) ([︿ ]*) \"([︿\"]*)\" \"([︿\"]*)\".*");

首先先宣告產生一個物件 java.util.regex.Pattern 這個類別沒有建構子,因此宣告出來之後用compile(String regex)敘述來建立滿足正規表示式的物件,功能說明:

Compiles the given regular expression into a pattern.

將正規表示式的字串當引數輸入之後,就可以得到一個p的Pattern物件,而此正規表示式:
([︿ ]*) ([︿ ]*) ([︿ ]*)
[([︿]]*)
] \"([︿\"]*)\" ([︿ ]*) ([︿ ]*) \"([︿\"]*)\" \"([︿\"]*)\".*

若apache log範例為:
140.110.138.176 - - [02/Jul/2008:16:55:02 +0800] "GET /hbase-0.1.3.zip HTTP/1.0" 200 10249801 "-" "Wget/1.10.2"
則此正規表示法可看成

1 2 3 4 5 6 7 8 9
([︿ ]*) ([︿ ]*) ([︿ ]*)
[([︿]]*)
]
\"([︿\"]*)\" ([︿ ]*) ([︿ ]*) \"([︿\"]*)\" \"([︿\"]*)\".*
ip - - 時間 "http " 回傳碼 長度 "指引" "代理器"
140.110.138.176 - - [02/Jul/2008:16:55:02 +0800] "GET /hbase-0.1.3.zip HTTP/1.0" 200 10249801 " -" "Wget/1.10.2"

在此可以把Pattern 當成是一個雛型類別,用compiler(表示式) 則告知了 以"表示式"為規則產生一個p的模板出來


  public LogParser(String line) throws ParseException, Exception{ 
	 Matcher matcher = p.matcher(line);
	 if(matcher.matches()){
		 this.ip = matcher.group(1);
		 // IP address of the client requesting the web page.
		 if(isIpAddress(ip)){
			 SimpleDateFormat sdf = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss Z",Locale.US);
			 this.timestamp = sdf.parse(matcher.group(4)).getTime();
			 String[] http = matcher.group(5).split(" ");
			 this.method = http[0];
			 this.url = http[1];
			 this.protocol = http[2];
			 this.code = matcher.group(6);
			 this.byteSize = matcher.group(7);
			 this.referrer = matcher.group(8);
			 this.agent = matcher.group(9);
		 }
	 }
  }

接著定義建構子,宣告了一個 java.util.regex.Matcher 此物件可以用來與之前的 Pattern搭配。
剛剛宣告的模板p有個函數 matcher(String) ,此功能會將材料(String敘述 )壓印成模板的形狀,並把這個壓出物件叫做matcher。 之後要取用matcher的第n段,只要用matcher.group(n)就可以把第n段的內容以String的形式取回。
回頭對照傳近來的內容


  public static boolean isIpAddress(String inputString) {
    StringTokenizer tokenizer = new StringTokenizer(inputString, ".");
    if (tokenizer.countTokens() != 4) {
      return false;
    }
    try {
      for (int i = 0; i < 4; i++) {
        String t = tokenizer.nextToken();
        int chunk = Integer.parseInt(t);
        if ((chunk & 255) != chunk) {
          return false;
        }
      }
    } catch (NumberFormatException e) {
      return false;
    }
    if (inputString.indexOf("..") >= 0) {
      return false;
    }
    return true;
  }
}