build-web-application-with-…/en/eBook/07.3.md

# 7.3 Regexp

Regexp is a complicated but powerful tool for pattern match and text manipulation. Although its performance is lower than pure text match, it's more flexible. Base on its syntax, you can almost filter any kind of text from your source content. If you need to collect data in web development, it's not hard to use Regexp to have meaningful data.

Go has package `regexp` as official support for regexp, if you've already used regexp in other programming languages, you should be familiar with it. Note that Go implemented RE2 standard except `\C`, more details: [http://code.google.com/p/re2/wiki/Syntax](http://code.google.com/p/re2/wiki/Syntax).

Actually, package `strings` does many jobs like search(Contains, Index), replace(Replace), parse(Split, Join), etc. and it's faster than Regexp, but these are simple operations. If you want to search a string without case sensitive, Regexp should be your best choice. So if package `strings` can achieve your goal, just use it, it's easy to use and read; if you need to more advanced operation, use Regexp obviously.

If you remember form verification we talked before, we used Regexp to verify if input information is valid there already. Be aware that all characters are UTF-8, and let's learn more about Go `regexp`!

## Match

Package `regexp` has 3 functions to match, if it matches returns true, returns false otherwise.

	func Match(pattern string, b []byte) (matched bool, error error)
	func MatchReader(pattern string, r io.RuneReader) (matched bool, error error)
	func MatchString(pattern string, s string) (matched bool, error error)

All of 3 functions check if `pattern` matches input source, returns true if it matches, but if your Regex has syntax error, it will return error. The 3 input sources of these functions are `slice of byte`, `RuneReader` and `string`.

Here is an example to verify IP address:

	func IsIP(ip string) (b bool) {
		if m, _ := regexp.MatchString("^[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}$", ip); !m {
			return false
		}
		return true
	}

As you can see, using pattern in package `regexp` is not that different. One more example, to verify if user input is valid:

	func main() {
		if len(os.Args) == 1 {
			fmt.Println("Usage: regexp [string]")
			os.Exit(1)
		} else if m, _ := regexp.MatchString("^[0-9]+$", os.Args[1]); m {
			fmt.Println("Number")
		} else {
			fmt.Println("Not number")
		}
	}

In above examples, we use `Match(Reader|Sting)` to check if content is valid, they are all easy to use.

## Filter

Match mode can verify content, but it cannot cut, filter or collect data from content. If you want to do that, you have to use complex mode of Regexp.

Sometimes we need to write a crawl, here is an example that shows you have to use Regexp to filter and cut data.

	package main

	import (
		"fmt"
		"io/ioutil"
		"net/http"
		"regexp"
		"strings"
	)

	func main() {
		resp, err := http.Get("http://www.baidu.com")
		if err != nil {
			fmt.Println("http get error.")
		}
		defer resp.Body.Close()
		body, err := ioutil.ReadAll(resp.Body)
		if err != nil {
			fmt.Println("http read error")
			return
		}

		src := string(body)

		// Convert HTML tags to lower case.
		re, _ := regexp.Compile("\\<[\\S\\s]+?\\>")
		src = re.ReplaceAllStringFunc(src, strings.ToLower)

		// Remove STYLE.
		re, _ = regexp.Compile("\\<style[\\S\\s]+?\\</style\\>")
		src = re.ReplaceAllString(src, "")

		// Remove SCRIPT.
		re, _ = regexp.Compile("\\<script[\\S\\s]+?\\</script\\>")
		src = re.ReplaceAllString(src, "")

		// Remove all HTML code in angle brackets, and replace with newline.
		re, _ = regexp.Compile("\\<[\\S\\s]+?\\>")
		src = re.ReplaceAllString(src, "\n")

		// Remove continuous newline.
		re, _ = regexp.Compile("\\s{2,}")
		src = re.ReplaceAllString(src, "\n")

		fmt.Println(strings.TrimSpace(src))
	}

In this example, we use Compile as the first step for complex mode. It verifies if your Regex syntax is correct, then returns `Regexp` for parsing content in other operations.

Here are some functions to parse your Regexp syntax:

	func Compile(expr string) (*Regexp, error)
	func CompilePOSIX(expr string) (*Regexp, error)
	func MustCompile(str string) *Regexp
	func MustCompilePOSIX(str string) *Regexp

The difference between `ComplePOSIX` and `Compile` is that the former has to use POSIX syntax which is leftmost longest search, and the latter is only leftmost search. For instance, for Regexp `[a-z]{2,4}` and content `"aa09aaa88aaaa"`, `CompilePOSIX` returns `aaaa` but `Compile` returns `aa`. `Must` prefix means panic when the Regexp syntax is not correct, returns error only otherwise.

After you knew how to create a new Regexp, let's see this struct provides what methods that help us to operate content:

	func (re *Regexp) Find(b []byte) []byte
	func (re *Regexp) FindAll(b []byte, n int) [][]byte
	func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
	func (re *Regexp) FindAllString(s string, n int) []string
	func (re *Regexp) FindAllStringIndex(s string, n int) [][]int
	func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string
	func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int
	func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
	func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
	func (re *Regexp) FindIndex(b []byte) (loc []int)
	func (re *Regexp) FindReaderIndex(r io.RuneReader) (loc []int)
	func (re *Regexp) FindReaderSubmatchIndex(r io.RuneReader) []int
	func (re *Regexp) FindString(s string) string
	func (re *Regexp) FindStringIndex(s string) (loc []int)
	func (re *Regexp) FindStringSubmatch(s string) []string
	func (re *Regexp) FindStringSubmatchIndex(s string) []int
	func (re *Regexp) FindSubmatch(b []byte) [][]byte
	func (re *Regexp) FindSubmatchIndex(b []byte) []int

These 18 methods including same function for different input sources(byte slice, string and io.RuneReader), we can simplify it by ignoring input sources as follows:

	func (re *Regexp) Find(b []byte) []byte
	func (re *Regexp) FindAll(b []byte, n int) [][]byte
	func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
	func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
	func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
	func (re *Regexp) FindIndex(b []byte) (loc []int)
	func (re *Regexp) FindSubmatch(b []byte) [][]byte
	func (re *Regexp) FindSubmatchIndex(b []byte) []int

Code sample:

	package main

	import (
		"fmt"
		"regexp"
	)

	func main() {
		a := "I am learning Go language"

		re, _ := regexp.Compile("[a-z]{2,4}")

		// Find the first match.
		one := re.Find([]byte(a))
		fmt.Println("Find:", string(one))

		// Find all matches and save to a slice, n less than 0 means return all matches, indicates length of slice if it's greater than 0.
		all := re.FindAll([]byte(a), -1)
		fmt.Println("FindAll", all)

		// Find index of first match, start and end position.
		index := re.FindIndex([]byte(a))
		fmt.Println("FindIndex", index)

		// Find index of all matches, the n does same job as above.
		allindex := re.FindAllIndex([]byte(a), -1)
		fmt.Println("FindAllIndex", allindex)

		re2, _ := regexp.Compile("am(.*)lang(.*)")

		// Find first submatch and return array, the first element contains all elements, the second element contains the result of first (), the third element contains the result of second ().
		// Output:
		// the first element: "am learning Go language"
		// the second element: " learning Go ", notice spaces will be outputed as well.
		// the third element: "uage"
		submatch := re2.FindSubmatch([]byte(a))
		fmt.Println("FindSubmatch", submatch)
		for _, v := range submatch {
			fmt.Println(string(v))
		}

		// Same thing like FindIndex().
		submatchindex := re2.FindSubmatchIndex([]byte(a))
		fmt.Println(submatchindex)

		// FindAllSubmatch, find all submatches.
		submatchall := re2.FindAllSubmatch([]byte(a), -1)
		fmt.Println(submatchall)

		// FindAllSubmatchIndex,find index of all submatches.
		submatchallindex := re2.FindAllSubmatchIndex([]byte(a), -1)
		fmt.Println(submatchallindex)
	}

As we introduced before, Regexp also has 3 methods for matching, they do exactly same thing as exported functions, those exported functions call these methods underlying:

	func (re *Regexp) Match(b []byte) bool
	func (re *Regexp) MatchReader(r io.RuneReader) bool
	func (re *Regexp) MatchString(s string) bool

Next, let's see how to do displacement through Regexp:

	func (re *Regexp) ReplaceAll(src, repl []byte) []byte
	func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte
	func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte
	func (re *Regexp) ReplaceAllLiteralString(src, repl string) string
	func (re *Regexp) ReplaceAllString(src, repl string) string
	func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string

These are used in crawl example, so we don't explain more here.

Let's take a look at explanation of `Expand`:

	func (re *Regexp) Expand(dst []byte, template []byte, src []byte, match []int) []byte
	func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte

So how to use `Expand`?

	func main() {
		src := []byte(`
			call hello alice
			hello bob
			call hello eve
		`)
		pat := regexp.MustCompile(`(?m)(call)\s+(?P<cmd>\w+)\s+(?P<arg>.+)\s*$`)
		res := []byte{}
		for _, s := range pat.FindAllSubmatchIndex(src, -1) {
			res = pat.Expand(res, []byte("$cmd('$arg')\n"), src, s)
		}
		fmt.Println(string(res))
	}

At this point, you learned whole package `regexp` in Go, I hope you can understand more by studying examples of key methods, and do something interesting by yourself.

## Links

- [Directory](preface.md)
- Previous section: [JSON](07.2.md)
- Next section: [Templates](07.4.md)