build-web-application-with-…/7.3.md

#7.3 正则处理
正则表达式是进行模式匹配和文本操纵的一种复杂而强大的工具。虽然正则表达式没有纯粹的文本匹配速度那么快，但应用起来相当的灵活。正则表达式通过简单的语法(一些简单的符号)构造模式能够匹配几乎任何可以想得到的字符组合。如果你在Web开发中需要从一些文本数据源中获取数据，那么正则表达式就能够帮你从这些数据源中提取出有意义的信息。

Go语言标准包里面已经包含有`regexp`，实现了正则表达式的搜索匹配，接受和python、perl或者其他语言一样的正则表达式语法，更准确的说，它实现了RE2标准，除了`\C`，详细的语法描述参考：http://code.google.com/p/re2/wiki/Syntax 如果你在其他语言里面使用过正则，Go实现的正则语法基本都一致，那么只需要了解一下`regexp`包里面的一些函数参数就可以了。

其实字符串处理我们可以使用`strings`包来进行搜索(Contains、Index)、替换(Replace)和解析(Split、Join)等操作，但是这些都是简单的字符串操作，他们的搜索都是大小写敏感，而且固定的字符串，如果我们需要匹配可变的那种就没办法实现了，当然如果`strings`包能解决你的问题，那么就尽量使用它来解决。因为他们足够简单、而且性能和可读性都会比正则好。

我们在前面表单验证的小节里面已经接触过正则处理，我们利用了正则表达式来匹配输入的信息是否和相应的格式匹配。在使用中我们需要注意一点，所有的字符都是UTF-8编码的。接下来让我们更加深入的来理解Go语言的`regexp`包。

##通过正则判断是否匹配
`regexp`包中含有三个函数用来判断是否匹配，如果匹配返回true，否则返回false

	func Match(pattern string, b []byte) (matched bool, error error)
	func MatchReader(pattern string, r io.RuneReader) (matched bool, error error)
	func MatchString(pattern string, s string) (matched bool, error error)

上面的三个函数实现了同一个功能，就是判断`pattern`是否和输入源匹配，匹配的话就返回true，如果解析正则出错则返回error。三个函数的输入源分别是byte slice、RuneReader和string。

如果要验证一个输入是不是IP地址，那么如何来判断呢？请看如下实现

	func IsIP(ip string) (b bool) {
		if m, _ := regexp.MatchString("^[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}$", ip); !m {
			return false
		}
		return true
	}

我们可以看到，`regexp`的pattern和我们平常使用的正则一模一样。接下来我们再来看一个例子，用户输入一个字符串，我们想知道是不是合法输入：

	func main() {
  		if len(os.Args) == 1 {
    		fmt.Println("Usage: regexp [string]")
    		os.Exit(1)
  		} else if m, _ := regexp.MatchString("^[0-9]+$", os.Args[1]); m {
    		fmt.Println("数字")
  		} else {
    		fmt.Println("不是数字")
  		}
	}

通过上面的一些例子，我们了解到如果要判断一些字符串是否符合我们的描述需求，我们采用Match(Reader|String)来进行判断，这个使用非常方便。

##通过正则获取内容
我们发现采用上面Match模式只能用来作为做字符串的判断，而无法截取字符串的一部分，或者过滤字符串、或者符合条件的一批字符串。如果想要使用这些，那就需要使用正则表达式的复杂模式。

我们经常需要一些爬虫程序，下面我们就以爬虫举例来说明如何过滤和截取抓取的数据：

	package main

	import (
		"fmt"
		"io/ioutil"
		"net/http"
		"regexp"
		"strings"
	)

	func main() {
		resp, err := http.Get("http://www.baidu.com")
		if err != nil {
			fmt.Println("http get error.")
		}
		defer resp.Body.Close()
		body, err := ioutil.ReadAll(resp.Body)
		if err != nil {
			fmt.Println("http read error")
		}

		src := string(body)

		//将HTML标签全转换成小写
		re, _ := regexp.Compile("\\<[\\S\\s]+?\\>")
		src = re.ReplaceAllStringFunc(src, strings.ToLower)

		//去除STYLE
		re, _ = regexp.Compile("\\<style[\\S\\s]+?\\</style\\>")
		src = re.ReplaceAllString(src, "")

		//去除SCRIPT
		re, _ = regexp.Compile("\\<script[\\S\\s]+?\\</script\\>")
		src = re.ReplaceAllString(src, "")

		//去除所有尖括号内的HTML代码，并换成换行符
		re, _ = regexp.Compile("\\<[\\S\\s]+?\\>")
		src = re.ReplaceAllString(src, "\n")

		//去除连续的换行符
		re, _ = regexp.Compile("\\s{2,}")
		src = re.ReplaceAllString(src, "\n")

		fmt.Println(strings.TrimSpace(src))
	}

上面的演示代码我们可以看出来，使用复杂的正则是首先调用Compile，Compile会解析正则表达式是否合法，如果正确，那么就会返回一个Regexp，然后可以在任意的字符串上面应用相应的函数进行过滤、截取等操作。

解析正则表达式的有如下几个方法：

	func Compile(expr string) (*Regexp, error)
	func CompilePOSIX(expr string) (*Regexp, error)
	func MustCompile(str string) *Regexp
	func MustCompilePOSIX(str string) *Regexp

Compile和CompilePOSIX不同就是POSIX必须使用POSIX语法，然后他使用最左最长方式搜索，而Compile是采用最左方式搜索(例如[a-z]{2,4}这样正则在搜索的时候可能搜索到任意的字母组合，那么他搜索满足条件的最左边最长的那个匹配，"aa09aaa88aaaa"，当搜索这个字符串的时候，应该是返回aaaa，而不是第一个匹配的，而Compile的返回aa，最左边最先匹配的字符串)。前缀有Must的表示，在解析正则语法的时候，调用该函数如果解析正则语法出错直接抛出panic，而不加的则返回错误。

在了解了如何新建一个Regexp之后，我们来看看这个struct有哪些方法可以让我们来操作字符串，首先我们来看下面这写用来搜索的函数：

	func (re *Regexp) Find(b []byte) []byte
    func (re *Regexp) FindAll(b []byte, n int) [][]byte
    func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
    func (re *Regexp) FindAllString(s string, n int) []string
    func (re *Regexp) FindAllStringIndex(s string, n int) [][]int
    func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string
    func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int
    func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
    func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
    func (re *Regexp) FindIndex(b []byte) (loc []int)
    func (re *Regexp) FindReaderIndex(r io.RuneReader) (loc []int)
    func (re *Regexp) FindReaderSubmatchIndex(r io.RuneReader) []int
    func (re *Regexp) FindString(s string) string
    func (re *Regexp) FindStringIndex(s string) (loc []int)
    func (re *Regexp) FindStringSubmatch(s string) []string
    func (re *Regexp) FindStringSubmatchIndex(s string) []int
    func (re *Regexp) FindSubmatch(b []byte) [][]byte
    func (re *Regexp) FindSubmatchIndex(b []byte) []int

上面这18个函数我们根据输入源(byte slice、string和io.RuneReader)不同还可以继续简化成如下几个，其他的只是输入源不一样，其他功能基本是一样的：

	func (re *Regexp) Find(b []byte) []byte
    func (re *Regexp) FindAll(b []byte, n int) [][]byte
    func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
	func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
    func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
	func (re *Regexp) FindIndex(b []byte) (loc []int)
	func (re *Regexp) FindSubmatch(b []byte) [][]byte
    func (re *Regexp) FindSubmatchIndex(b []byte) []int

对于这些函数的使用我们来看下面这个例子

	package main

	import (
		"fmt"
		"regexp"
	)

	func main() {
		a := "I am learning Go language"

		re, _ := regexp.Compile("[a-z]{2,4}")

		//查找符合正则的第一个
		one := re.Find([]byte(a))
		fmt.Println("Find:", string(one))

		//查找符合正则的所有slice,n小于0表示返回全部符合的字符串，不然就是返回指定的长度
		all := re.FindAll([]byte(a), -1)
		fmt.Println("FindAll", all)

		//查找符合条件的index位置,开始位置和结束位置
		index := re.FindIndex([]byte(a))
		fmt.Println("FindIndex", index)

		//查找符合条件的所有的index位置，n同上
		allindex := re.FindAllIndex([]byte(a), -1)
		fmt.Println("FindAllIndex", allindex)

		re2, _ := regexp.Compile("am(.*)lang(.*)")

		//查找Submatch,返回数组，第一个元素是匹配的全部元素，第二个元素是第一个()里面的，第三个是第二个()里面的
		//下面的输出第一个元素是"am learning Go language"
		//第二个元素是" learning Go "，注意包含空格的输出
		//第三个元素是"uage"
		submatch := re2.FindSubmatch([]byte(a))
		fmt.Println("FindSubmatch", submatch)
		for _, v := range submatch {
			fmt.Println(string(v))
		}

		//定义和上面的FindIndex一样
		submatchindex := re2.FindSubmatchIndex([]byte(a))
		fmt.Println(submatchindex)

		//FindAllSubmatch,查找所有符合条件的子匹配
		submatchall := re2.FindAllSubmatch([]byte(a), -1)
		fmt.Println(submatchall)

		//FindAllSubmatchIndex,查找所有字匹配的index
		submatchallindex := re2.FindAllSubmatchIndex([]byte(a), -1)
		fmt.Println(submatchallindex)
	}

我们前面介绍过匹配函数，那么Regexp对象也有这三个函数，和上面外部的三个函数功能一模一样，上面三个外部的函数其实内部实现就是调用了这三个函数：

	func (re *Regexp) Match(b []byte) bool
    func (re *Regexp) MatchReader(r io.RuneReader) bool
    func (re *Regexp) MatchString(s string) bool

接下里让我们来了解替换函数是怎么操作的？

	func (re *Regexp) ReplaceAll(src, repl []byte) []byte
    func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte
    func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte
    func (re *Regexp) ReplaceAllLiteralString(src, repl string) string
    func (re *Regexp) ReplaceAllString(src, repl string) string
    func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string

这些替换函数我们在上面的抓网页的例子有详细应用示例，

接下来我们看一下Expand的解释：

	func (re *Regexp) Expand(dst []byte, template []byte, src []byte, match []int) []byte
    func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte

那么这个Expand到底用来干嘛的呢？请看下面的例子：

	func main() {
		src := []byte(`
			call hello alice
			hello bob
			call hello eve
		`)
		pat := regexp.MustCompile(`(?m)(call)\s+(?P<cmd>\w+)\s+(?P<arg>.+)\s*$`)
		res := []byte{}
		for _, s := range pat.FindAllSubmatchIndex(src, -1) {
			res = pat.Expand(res, []byte("$cmd('$arg')\n"), src, s)
		}
		fmt.Println(string(res))
	}

至此我们已经全部介绍完Go语言的`regexp`包，通过上面对他的每个函数的介绍以及通过例子来演示，大家应该能够通过Go语言的正则包进行基本一些正则的操作了。


## links
   * [目录](<preface.md>)
   * 上一节: [Json处理](<7.2.md>)
   * 下一节: [模板处理](<7.4.md>)

## LastModified
   * $Id$