Merging other languages

2016-09-23 18:01:10 -03:00
parent 380a8ee74c
commit de3c5bdaa4
490 changed files with 24539 additions and 24588 deletions
--- a/en/07.3.md
+++ b/en/07.3.md
@@ -1,237 +1,241 @@
-# 7.3 正则处理
-正则表达式是一种进行模式匹配和文本操纵的复杂而又强大的工具。虽然正则表达式比纯粹的文本匹配效率低，但是它却更灵活。按照它的语法规则，随需构造出的匹配模式就能够从原始文本中筛选出几乎任何你想要得到的字符组合。如果你在Web开发中需要从一些文本数据源中获取数据,那么你只需要按照它的语法规则，随需构造出正确的模式字符串就能够从原数据源提取出有意义的文本信息。
-
-Go语言通过`regexp`标准包为正则表达式提供了官方支持，如果你已经使用过其他编程语言提供的正则相关功能，那么你应该对Go语言版本的不会太陌生，但是它们之间也有一些小的差异，因为Go实现的是RE2标准，除了\C，详细的语法描述参考：`http://code.google.com/p/re2/wiki/Syntax`
-
-其实字符串处理我们可以使用`strings`包来进行搜索(Contains、Index)、替换(Replace)和解析(Split、Join)等操作，但是这些都是简单的字符串操作，他们的搜索都是大小写敏感，而且固定的字符串，如果我们需要匹配可变的那种就没办法实现了，当然如果`strings`包能解决你的问题，那么就尽量使用它来解决。因为他们足够简单、而且性能和可读性都会比正则好。
-
-如果你还记得，在前面表单验证的小节里，我们已经接触过正则处理，在那里我们利用了它来验证输入的信息是否满足某些预设的条件。在使用中需要注意的一点就是：所有的字符都是UTF-8编码的。接下来让我们更加深入的来学习Go语言的`regexp`包相关知识吧。
-
-## 通过正则判断是否匹配
-`regexp`包中含有三个函数用来判断是否匹配，如果匹配返回true，否则返回false
-
-	func Match(pattern string, b []byte) (matched bool, error error)
-	func MatchReader(pattern string, r io.RuneReader) (matched bool, error error)
-	func MatchString(pattern string, s string) (matched bool, error error)
-
-上面的三个函数实现了同一个功能，就是判断`pattern`是否和输入源匹配，匹配的话就返回true，如果解析正则出错则返回error。三个函数的输入源分别是byte slice、RuneReader和string。
-
-如果要验证一个输入是不是IP地址，那么如何来判断呢？请看如下实现
-
-	func IsIP(ip string) (b bool) {
-		if m, _ := regexp.MatchString("^[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}$", ip); !m {
-			return false
-		}
-		return true
-	}
-
-可以看到，`regexp`的pattern和我们平常使用的正则一模一样。再来看一个例子：当用户输入一个字符串，我们想知道是不是一次合法的输入：
-
-	func main() {
-		if len(os.Args) == 1 {
-			fmt.Println("Usage: regexp [string]")
-			os.Exit(1)
-		} else if m, _ := regexp.MatchString("^[0-9]+$", os.Args[1]); m {
-			fmt.Println("数字")
-		} else {
-			fmt.Println("不是数字")
-		}
-	}
-
-在上面的两个小例子中，我们采用了Match(Reader|String)来判断一些字符串是否符合我们的描述需求，它们使用起来非常方便。
-
-## 通过正则获取内容
-Match模式只能用来对字符串的判断，而无法截取字符串的一部分、过滤字符串、或者提取出符合条件的一批字符串。如果想要满足这些需求，那就需要使用正则表达式的复杂模式。
-
-我们经常需要一些爬虫程序，下面就以爬虫为例来说明如何使用正则来过滤或截取抓取到的数据：
-
-	package main
-
-	import (
-		"fmt"
-		"io/ioutil"
-		"net/http"
-		"regexp"
-		"strings"
-	)
-
-	func main() {
-		resp, err := http.Get("http://www.baidu.com")
-		if err != nil {
-			fmt.Println("http get error.")
-		}
-		defer resp.Body.Close()
-		body, err := ioutil.ReadAll(resp.Body)
-		if err != nil {
-			fmt.Println("http read error")
-			return
-		}
-
-		src := string(body)
-
-		//将HTML标签全转换成小写
-		re, _ := regexp.Compile("\\<[\\S\\s]+?\\>")
-		src = re.ReplaceAllStringFunc(src, strings.ToLower)
-
-		//去除STYLE
-		re, _ = regexp.Compile("\\<style[\\S\\s]+?\\</style\\>")
-		src = re.ReplaceAllString(src, "")
-
-		//去除SCRIPT
-		re, _ = regexp.Compile("\\<script[\\S\\s]+?\\</script\\>")
-		src = re.ReplaceAllString(src, "")
-
-		//去除所有尖括号内的HTML代码，并换成换行符
-		re, _ = regexp.Compile("\\<[\\S\\s]+?\\>")
-		src = re.ReplaceAllString(src, "\n")
-
-		//去除连续的换行符
-		re, _ = regexp.Compile("\\s{2,}")
-		src = re.ReplaceAllString(src, "\n")
-
-		fmt.Println(strings.TrimSpace(src))
-	}
-
-从这个示例可以看出，使用复杂的正则首先是Compile，它会解析正则表达式是否合法，如果正确，那么就会返回一个Regexp，然后就可以利用返回的Regexp在任意的字符串上面执行需要的操作。
-
-解析正则表达式的有如下几个方法：
-
-	func Compile(expr string) (*Regexp, error)
-	func CompilePOSIX(expr string) (*Regexp, error)
-	func MustCompile(str string) *Regexp
-	func MustCompilePOSIX(str string) *Regexp
-
-CompilePOSIX和Compile的不同点在于POSIX必须使用POSIX语法，它使用最左最长方式搜索，而Compile是采用的则只采用最左方式搜索(例如[a-z]{2,4}这样一个正则表达式，应用于"aa09aaa88aaaa"这个文本串时，CompilePOSIX返回了aaaa，而Compile的返回的是aa)。前缀有Must的函数表示，在解析正则语法的时候，如果匹配模式串不满足正确的语法则直接panic，而不加Must的则只是返回错误。
-
-在了解了如何新建一个Regexp之后，我们再来看一下这个struct提供了哪些方法来辅助我们操作字符串，首先我们来看下面这些用来搜索的函数：
-
-	func (re *Regexp) Find(b []byte) []byte
-	func (re *Regexp) FindAll(b []byte, n int) [][]byte
-	func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
-	func (re *Regexp) FindAllString(s string, n int) []string
-	func (re *Regexp) FindAllStringIndex(s string, n int) [][]int
-	func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string
-	func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int
-	func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
-	func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
-	func (re *Regexp) FindIndex(b []byte) (loc []int)
-	func (re *Regexp) FindReaderIndex(r io.RuneReader) (loc []int)
-	func (re *Regexp) FindReaderSubmatchIndex(r io.RuneReader) []int
-	func (re *Regexp) FindString(s string) string
-	func (re *Regexp) FindStringIndex(s string) (loc []int)
-	func (re *Regexp) FindStringSubmatch(s string) []string
-	func (re *Regexp) FindStringSubmatchIndex(s string) []int
-	func (re *Regexp) FindSubmatch(b []byte) [][]byte
-	func (re *Regexp) FindSubmatchIndex(b []byte) []int
-
-上面这18个函数我们根据输入源(byte slice、string和io.RuneReader)不同还可以继续简化成如下几个，其他的只是输入源不一样，其他功能基本是一样的：
-
-	func (re *Regexp) Find(b []byte) []byte
-	func (re *Regexp) FindAll(b []byte, n int) [][]byte
-	func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
-	func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
-	func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
-	func (re *Regexp) FindIndex(b []byte) (loc []int)
-	func (re *Regexp) FindSubmatch(b []byte) [][]byte
-	func (re *Regexp) FindSubmatchIndex(b []byte) []int
-
-对于这些函数的使用我们来看下面这个例子
-
-	package main
-
-	import (
-		"fmt"
-		"regexp"
-	)
-
-	func main() {
-		a := "I am learning Go language"
-
-		re, _ := regexp.Compile("[a-z]{2,4}")
-
-		//查找符合正则的第一个
-		one := re.Find([]byte(a))
-		fmt.Println("Find:", string(one))
-
-		//查找符合正则的所有slice,n小于0表示返回全部符合的字符串，不然就是返回指定的长度
-		all := re.FindAll([]byte(a), -1)
-		fmt.Println("FindAll", all)
-
-		//查找符合条件的index位置,开始位置和结束位置
-		index := re.FindIndex([]byte(a))
-		fmt.Println("FindIndex", index)
-
-		//查找符合条件的所有的index位置，n同上
-		allindex := re.FindAllIndex([]byte(a), -1)
-		fmt.Println("FindAllIndex", allindex)
-
-		re2, _ := regexp.Compile("am(.*)lang(.*)")
-
-		//查找Submatch,返回数组，第一个元素是匹配的全部元素，第二个元素是第一个()里面的，第三个是第二个()里面的
-		//下面的输出第一个元素是"am learning Go language"
-		//第二个元素是" learning Go "，注意包含空格的输出
-		//第三个元素是"uage"
-		submatch := re2.FindSubmatch([]byte(a))
-		fmt.Println("FindSubmatch", submatch)
-		for _, v := range submatch {
-			fmt.Println(string(v))
-		}
-
-		//定义和上面的FindIndex一样
-		submatchindex := re2.FindSubmatchIndex([]byte(a))
-		fmt.Println(submatchindex)
-
-		//FindAllSubmatch,查找所有符合条件的子匹配
-		submatchall := re2.FindAllSubmatch([]byte(a), -1)
-		fmt.Println(submatchall)
-
-		//FindAllSubmatchIndex,查找所有字匹配的index
-		submatchallindex := re2.FindAllSubmatchIndex([]byte(a), -1)
-		fmt.Println(submatchallindex)
-	}
-
-前面介绍过匹配函数，Regexp也定义了三个函数，它们和同名的外部函数功能一模一样，其实外部函数就是调用了这Regexp的三个函数来实现的：
-
-	func (re *Regexp) Match(b []byte) bool
-	func (re *Regexp) MatchReader(r io.RuneReader) bool
-	func (re *Regexp) MatchString(s string) bool
-
-接下里让我们来了解替换函数是怎么操作的？
-
-	func (re *Regexp) ReplaceAll(src, repl []byte) []byte
-	func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte
-	func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte
-	func (re *Regexp) ReplaceAllLiteralString(src, repl string) string
-	func (re *Regexp) ReplaceAllString(src, repl string) string
-	func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string
-
-这些替换函数我们在上面的抓网页的例子有详细应用示例，
-
-接下来我们看一下Expand的解释：
-
-	func (re *Regexp) Expand(dst []byte, template []byte, src []byte, match []int) []byte
-	func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte
-
-那么这个Expand到底用来干嘛的呢？请看下面的例子：
-
-	func main() {
-		src := []byte(`
-			call hello alice
-			hello bob
-			call hello eve
-		`)
-		pat := regexp.MustCompile(`(?m)(call)\s+(?P<cmd>\w+)\s+(?P<arg>.+)\s*$`)
-		res := []byte{}
-		for _, s := range pat.FindAllSubmatchIndex(src, -1) {
-			res = pat.Expand(res, []byte("$cmd('$arg')\n"), src, s)
-		}
-		fmt.Println(string(res))
-	}
-
-至此我们已经全部介绍完Go语言的`regexp`包，通过对它的主要函数介绍及演示，相信大家应该能够通过Go语言的正则包进行一些基本的正则的操作了。
-
-
-## links
-   * [目录](<preface.md>)
-   * 上一节: [Json处理](<07.2.md>)
-   * 下一节: [模板处理](<07.4.md>)
+# 7.3 Regexp
+
+Regexp is a complicated but powerful tool for pattern matching and text manipulation. Although does not perform as well as pure text matching, it's more flexible. Based on its syntax, you can filter almost any kind of text from your source content. If you need to collect data in web development, it's not hard to use Regexp to retrieve meaningful data.
+
+Go has the `regexp` package, which provides official support for regexp. If you've already used regexp in other programming languages, you should be familiar with it. Note that Go implemented RE2 standard except for `\C`. For more details, follow this link: [http://code.google.com/p/re2/wiki/Syntax](http://code.google.com/p/re2/wiki/Syntax).
+
+Go's `strings` package can actually do many jobs like searching (Contains, Index), replacing (Replace), parsing (Split, Join), etc., and it's faster than Regexp. However, these are all trivial operations. If you want to search a case insensitive string, Regexp should be your best choice. So, if the `strings` package is sufficient for your needs, just use it since it's easy to use and read; if you need to perform more advanced operations, use Regexp.
+
+If you recall form verification from previous sections, we used Regexp to verify the validity of user input information. Be aware that all characters are UTF-8. Let's learn more about the Go `regexp` package!
+
+## Match
+
+The `regexp` package has 3 functions to match: if it matches a pattern, then it returns true, returning false otherwise.
+
+	func Match(pattern string, b []byte) (matched bool, error error)
+	func MatchReader(pattern string, r io.RuneReader) (matched bool, error error)
+	func MatchString(pattern string, s string) (matched bool, error error)
+
+All of 3 functions check if `pattern` matches the input source, returning true if it matches. However if your Regex has syntax errors, it will return an error. The 3 input sources of these functions are `slice of byte`, `RuneReader` and `string`.
+
+Here is an example of how to verify an IP address:
+
+	func IsIP(ip string) (b bool) {
+		if m, _ := regexp.MatchString("^[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}$", ip); !m {
+			return false
+		}
+		return true
+	}
+
+As you can see, using pattern in the `regexp` package is not that different. Here's one  more example on verifying if user input is valid:
+
+	func main() {
+		if len(os.Args) == 1 {
+			fmt.Println("Usage: regexp [string]")
+			os.Exit(1)
+		} else if m, _ := regexp.MatchString("^[0-9]+$", os.Args[1]); m {
+			fmt.Println("Number")
+		} else {
+			fmt.Println("Not number")
+		}
+	}
+
+In the above examples, we use `Match(Reader|Sting)` to check if content is valid, but they are all easy to use.
+
+## Filter
+
+Match mode can verify content but it cannot cut, filter or collect data from it. If you want to do that, you have to use complex mode of Regexp.
+
+Let's say we need to write a crawler. Here is an example that shows when you must use Regexp to filter and cut data.
+
+	package main
+
+	import (
+		"fmt"
+		"io/ioutil"
+		"net/http"
+		"regexp"
+		"strings"
+	)
+
+	func main() {
+		resp, err := http.Get("http://www.baidu.com")
+		if err != nil {
+			fmt.Println("http get error.")
+		}
+		defer resp.Body.Close()
+		body, err := ioutil.ReadAll(resp.Body)
+		if err != nil {
+			fmt.Println("http read error")
+			return
+		}
+
+		src := string(body)
+
+		// Convert HTML tags to lower case.
+		re, _ := regexp.Compile("\\<[\\S\\s]+?\\>")
+		src = re.ReplaceAllStringFunc(src, strings.ToLower)
+
+		// Remove STYLE.
+		re, _ = regexp.Compile("\\<style[\\S\\s]+?\\</style\\>")
+		src = re.ReplaceAllString(src, "")
+
+		// Remove SCRIPT.
+		re, _ = regexp.Compile("\\<script[\\S\\s]+?\\</script\\>")
+		src = re.ReplaceAllString(src, "")
+
+		// Remove all HTML code in angle brackets, and replace with newline.
+		re, _ = regexp.Compile("\\<[\\S\\s]+?\\>")
+		src = re.ReplaceAllString(src, "\n")
+
+		// Remove continuous newline.
+		re, _ = regexp.Compile("\\s{2,}")
+		src = re.ReplaceAllString(src, "\n")
+
+		fmt.Println(strings.TrimSpace(src))
+	}
+
+In this example, we use Compile as the first step for complex mode. It verifies that your Regex syntax is correct, then returns a `Regexp` for parsing content in other operations.
+
+Here are some functions to parse your Regexp syntax:
+
+	func Compile(expr string) (*Regexp, error)
+	func CompilePOSIX(expr string) (*Regexp, error)
+	func MustCompile(str string) *Regexp
+	func MustCompilePOSIX(str string) *Regexp
+
+The difference between `ComplePOSIX` and `Compile` is that the former has to use POSIX syntax which is leftmost longest search, and the latter is only leftmost search. For instance, for Regexp `[a-z]{2,4}` and content `"aa09aaa88aaaa"`, `CompilePOSIX` returns `aaaa` but `Compile` returns `aa`. `Must` prefix means panic when the Regexp syntax is not correct, returning error otherwise.
+
+Now that we know how to create a new Regexp, let's see what how the methods provided by this struct can help us to operate on content:
+
+	func (re *Regexp) Find(b []byte) []byte
+	func (re *Regexp) FindAll(b []byte, n int) [][]byte
+	func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
+	func (re *Regexp) FindAllString(s string, n int) []string
+	func (re *Regexp) FindAllStringIndex(s string, n int) [][]int
+	func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string
+	func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int
+	func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
+	func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
+	func (re *Regexp) FindIndex(b []byte) (loc []int)
+	func (re *Regexp) FindReaderIndex(r io.RuneReader) (loc []int)
+	func (re *Regexp) FindReaderSubmatchIndex(r io.RuneReader) []int
+	func (re *Regexp) FindString(s string) string
+	func (re *Regexp) FindStringIndex(s string) (loc []int)
+	func (re *Regexp) FindStringSubmatch(s string) []string
+	func (re *Regexp) FindStringSubmatchIndex(s string) []int
+	func (re *Regexp) FindSubmatch(b []byte) [][]byte
+	func (re *Regexp) FindSubmatchIndex(b []byte) []int
+
+These 18 methods include identical functions for different input sources (byte slice, string and io.RuneReader), so we can really simplify this list by ignoring input sources as follows:
+
+	func (re *Regexp) Find(b []byte) []byte
+	func (re *Regexp) FindAll(b []byte, n int) [][]byte
+	func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
+	func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
+	func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
+	func (re *Regexp) FindIndex(b []byte) (loc []int)
+	func (re *Regexp) FindSubmatch(b []byte) [][]byte
+	func (re *Regexp) FindSubmatchIndex(b []byte) []int
+
+Code sample:
+
+	package main
+
+	import (
+		"fmt"
+		"regexp"
+	)
+
+	func main() {
+		a := "I am learning Go language"
+
+		re, _ := regexp.Compile("[a-z]{2,4}")
+
+		// Find the first match.
+		one := re.Find([]byte(a))
+		fmt.Println("Find:", string(one))
+
+		// Find all matches and save to a slice, n less than 0 means return all matches, indicates length of slice if it's greater than 0.
+		all := re.FindAll([]byte(a), -1)
+		fmt.Println("FindAll", all)
+
+		// Find index of first match, start and end position.
+		index := re.FindIndex([]byte(a))
+		fmt.Println("FindIndex", index)
+
+		// Find index of all matches, the n does same job as above.
+		allindex := re.FindAllIndex([]byte(a), -1)
+		fmt.Println("FindAllIndex", allindex)
+
+		re2, _ := regexp.Compile("am(.*)lang(.*)")
+
+		// Find first submatch and return array, the first element contains all elements, the second element contains the result of first (), the third element contains the result of second ().
+		// Output: 
+		// the first element: "am learning Go language"
+		// the second element: " learning Go ", notice spaces will be outputed as well.
+		// the third element: "uage"
+		submatch := re2.FindSubmatch([]byte(a))
+		fmt.Println("FindSubmatch", submatch)
+		for _, v := range submatch {
+			fmt.Println(string(v))
+		}
+
+		// Same thing like FindIndex().
+		submatchindex := re2.FindSubmatchIndex([]byte(a))
+		fmt.Println(submatchindex)
+
+		// FindAllSubmatch, find all submatches.
+		submatchall := re2.FindAllSubmatch([]byte(a), -1)
+		fmt.Println(submatchall)
+
+		// FindAllSubmatchIndex,find index of all submatches.
+		submatchallindex := re2.FindAllSubmatchIndex([]byte(a), -1)
+		fmt.Println(submatchallindex)
+	}
+
+As we've previously introduced, Regexp also has 3 methods for matching. They do the exact same things as the exported functions. In fact, those exported functions actually call these methods under the hood:
+
+	func (re *Regexp) Match(b []byte) bool
+	func (re *Regexp) MatchReader(r io.RuneReader) bool
+	func (re *Regexp) MatchString(s string) bool
+
+Next, let's see how to replace strings using Regexp:
+
+	func (re *Regexp) ReplaceAll(src, repl []byte) []byte
+	func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte
+	func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte
+	func (re *Regexp) ReplaceAllLiteralString(src, repl string) string
+	func (re *Regexp) ReplaceAllString(src, repl string) string
+	func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string
+
+These are used in the crawling example, so we don't explain more here.
+
+Let's take a look at the definition of `Expand`:
+
+	func (re *Regexp) Expand(dst []byte, template []byte, src []byte, match []int) []byte
+	func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte
+
+So how do we use `Expand`?
+
+	func main() {
+		src := []byte(`
+			call hello alice
+			hello bob
+			call hello eve
+		`)
+		pat := regexp.MustCompile(`(?m)(call)\s+(?P<cmd>\w+)\s+(?P<arg>.+)\s*$`)
+		res := []byte{}
+		for _, s := range pat.FindAllSubmatchIndex(src, -1) {
+			res = pat.Expand(res, []byte("$cmd('$arg')\n"), src, s)
+		}
+		fmt.Println(string(res))
+	}
+
+At this point, you've learned the whole `regexp` package in Go. I hope that you can understand more by studying examples of key methods, so that you can do something interesting on your own.
+
+## Links
+
+- [Directory](preface.md)
+- Previous section: [JSON](07.2.md)
+- Next section: [Templates](07.4.md)