241 lines
9.5 KiB
Markdown
241 lines
9.5 KiB
Markdown
# 7.3 Regexp
|
|
|
|
Regexp is a complicated but powerful tool for pattern match and text manipulation. Although its performance is lower than pure text match, it's more flexible. Base on its syntax, you can almost filter any kind of text from your source content. If you need to collect data in web development, it's not hard to use Regexp to have meaningful data.
|
|
|
|
Go has package `regexp` as official support for regexp, if you've already used regexp in other programming languages, you should be familiar with it. Note that Go implemented RE2 standard except `\C`, more details: [http://code.google.com/p/re2/wiki/Syntax](http://code.google.com/p/re2/wiki/Syntax).
|
|
|
|
Actually, package `strings` does many jobs like search(Contains, Index), replace(Replace), parse(Split, Join), etc. and it's faster than Regexp, but these are simple operations. If you want to search a string without case sensitive, Regexp should be your best choice. So if package `strings` can achieve your goal, just use it, it's easy to use and read; if you need to more advanced operation, use Regexp obviously.
|
|
|
|
If you remember form verification we talked before, we used Regexp to verify if input information is valid there already. Be aware that all characters are UTF-8, and let's learn more about Go `regexp`!
|
|
|
|
## Match
|
|
|
|
Package `regexp` has 3 functions to match, if it matches returns true, returns false otherwise.
|
|
|
|
func Match(pattern string, b []byte) (matched bool, error error)
|
|
func MatchReader(pattern string, r io.RuneReader) (matched bool, error error)
|
|
func MatchString(pattern string, s string) (matched bool, error error)
|
|
|
|
All of 3 functions check if `pattern` matches input source, returns true if it matches, but if your Regex has syntax error, it will return error. The 3 input sources of these functions are `slice of byte`, `RuneReader` and `string`.
|
|
|
|
Here is an example to verify IP address:
|
|
|
|
func IsIP(ip string) (b bool) {
|
|
if m, _ := regexp.MatchString("^[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}$", ip); !m {
|
|
return false
|
|
}
|
|
return true
|
|
}
|
|
|
|
As you can see, using pattern in package `regexp` is not that different. One more example, to verify if user input is valid:
|
|
|
|
func main() {
|
|
if len(os.Args) == 1 {
|
|
fmt.Println("Usage: regexp [string]")
|
|
os.Exit(1)
|
|
} else if m, _ := regexp.MatchString("^[0-9]+$", os.Args[1]); m {
|
|
fmt.Println("Number")
|
|
} else {
|
|
fmt.Println("Not number")
|
|
}
|
|
}
|
|
|
|
In above examples, we use `Match(Reader|Sting)` to check if content is valid, they are all easy to use.
|
|
|
|
## Filter
|
|
|
|
Match mode can verify content, but it cannot cut, filter or collect data from content. If you want to do that, you have to use complex mode of Regexp.
|
|
|
|
Sometimes we need to write a crawl, here is an example that shows you have to use Regexp to filter and cut data.
|
|
|
|
package main
|
|
|
|
import (
|
|
"fmt"
|
|
"io/ioutil"
|
|
"net/http"
|
|
"regexp"
|
|
"strings"
|
|
)
|
|
|
|
func main() {
|
|
resp, err := http.Get("http://www.baidu.com")
|
|
if err != nil {
|
|
fmt.Println("http get error.")
|
|
}
|
|
defer resp.Body.Close()
|
|
body, err := ioutil.ReadAll(resp.Body)
|
|
if err != nil {
|
|
fmt.Println("http read error")
|
|
return
|
|
}
|
|
|
|
src := string(body)
|
|
|
|
// Convert HTML tags to lower case.
|
|
re, _ := regexp.Compile("\\<[\\S\\s]+?\\>")
|
|
src = re.ReplaceAllStringFunc(src, strings.ToLower)
|
|
|
|
// Remove STYLE.
|
|
re, _ = regexp.Compile("\\<style[\\S\\s]+?\\</style\\>")
|
|
src = re.ReplaceAllString(src, "")
|
|
|
|
// Remove SCRIPT.
|
|
re, _ = regexp.Compile("\\<script[\\S\\s]+?\\</script\\>")
|
|
src = re.ReplaceAllString(src, "")
|
|
|
|
// Remove all HTML code in angle brackets, and replace with newline.
|
|
re, _ = regexp.Compile("\\<[\\S\\s]+?\\>")
|
|
src = re.ReplaceAllString(src, "\n")
|
|
|
|
// Remove continuous newline.
|
|
re, _ = regexp.Compile("\\s{2,}")
|
|
src = re.ReplaceAllString(src, "\n")
|
|
|
|
fmt.Println(strings.TrimSpace(src))
|
|
}
|
|
|
|
In this example, we use Compile as the first step for complex mode. It verifies if your Regex syntax is correct, then returns `Regexp` for parsing content in other operations.
|
|
|
|
Here are some functions to parse your Regexp syntax:
|
|
|
|
func Compile(expr string) (*Regexp, error)
|
|
func CompilePOSIX(expr string) (*Regexp, error)
|
|
func MustCompile(str string) *Regexp
|
|
func MustCompilePOSIX(str string) *Regexp
|
|
|
|
The difference between `ComplePOSIX` and `Compile` is that the former has to use POSIX syntax which is leftmost longest search, and the latter is only leftmost search. For instance, for Regexp `[a-z]{2,4}` and content `"aa09aaa88aaaa"`, `CompilePOSIX` returns `aaaa` but `Compile` returns `aa`. `Must` prefix means panic when the Regexp syntax is not correct, returns error only otherwise.
|
|
|
|
After you knew how to create a new Regexp, let's see this struct provides what methods that help us to operate content:
|
|
|
|
func (re *Regexp) Find(b []byte) []byte
|
|
func (re *Regexp) FindAll(b []byte, n int) [][]byte
|
|
func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
|
|
func (re *Regexp) FindAllString(s string, n int) []string
|
|
func (re *Regexp) FindAllStringIndex(s string, n int) [][]int
|
|
func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string
|
|
func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int
|
|
func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
|
|
func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
|
|
func (re *Regexp) FindIndex(b []byte) (loc []int)
|
|
func (re *Regexp) FindReaderIndex(r io.RuneReader) (loc []int)
|
|
func (re *Regexp) FindReaderSubmatchIndex(r io.RuneReader) []int
|
|
func (re *Regexp) FindString(s string) string
|
|
func (re *Regexp) FindStringIndex(s string) (loc []int)
|
|
func (re *Regexp) FindStringSubmatch(s string) []string
|
|
func (re *Regexp) FindStringSubmatchIndex(s string) []int
|
|
func (re *Regexp) FindSubmatch(b []byte) [][]byte
|
|
func (re *Regexp) FindSubmatchIndex(b []byte) []int
|
|
|
|
These 18 methods including same function for different input sources(byte slice, string and io.RuneReader), we can simplify it by ignoring input sources as follows:
|
|
|
|
func (re *Regexp) Find(b []byte) []byte
|
|
func (re *Regexp) FindAll(b []byte, n int) [][]byte
|
|
func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
|
|
func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
|
|
func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
|
|
func (re *Regexp) FindIndex(b []byte) (loc []int)
|
|
func (re *Regexp) FindSubmatch(b []byte) [][]byte
|
|
func (re *Regexp) FindSubmatchIndex(b []byte) []int
|
|
|
|
Code sample:
|
|
|
|
package main
|
|
|
|
import (
|
|
"fmt"
|
|
"regexp"
|
|
)
|
|
|
|
func main() {
|
|
a := "I am learning Go language"
|
|
|
|
re, _ := regexp.Compile("[a-z]{2,4}")
|
|
|
|
// Find the first match.
|
|
one := re.Find([]byte(a))
|
|
fmt.Println("Find:", string(one))
|
|
|
|
// Find all matches and save to a slice, n less than 0 means return all matches, indicates length of slice if it's greater than 0.
|
|
all := re.FindAll([]byte(a), -1)
|
|
fmt.Println("FindAll", all)
|
|
|
|
// Find index of first match, start and end position.
|
|
index := re.FindIndex([]byte(a))
|
|
fmt.Println("FindIndex", index)
|
|
|
|
// Find index of all matches, the n does same job as above.
|
|
allindex := re.FindAllIndex([]byte(a), -1)
|
|
fmt.Println("FindAllIndex", allindex)
|
|
|
|
re2, _ := regexp.Compile("am(.*)lang(.*)")
|
|
|
|
// Find first submatch and return array, the first element contains all elements, the second element contains the result of first (), the third element contains the result of second ().
|
|
// Output:
|
|
// the first element: "am learning Go language"
|
|
// the second element: " learning Go ", notice spaces will be outputed as well.
|
|
// the third element: "uage"
|
|
submatch := re2.FindSubmatch([]byte(a))
|
|
fmt.Println("FindSubmatch", submatch)
|
|
for _, v := range submatch {
|
|
fmt.Println(string(v))
|
|
}
|
|
|
|
// Same thing like FindIndex().
|
|
submatchindex := re2.FindSubmatchIndex([]byte(a))
|
|
fmt.Println(submatchindex)
|
|
|
|
// FindAllSubmatch, find all submatches.
|
|
submatchall := re2.FindAllSubmatch([]byte(a), -1)
|
|
fmt.Println(submatchall)
|
|
|
|
// FindAllSubmatchIndex,find index of all submatches.
|
|
submatchallindex := re2.FindAllSubmatchIndex([]byte(a), -1)
|
|
fmt.Println(submatchallindex)
|
|
}
|
|
|
|
As we introduced before, Regexp also has 3 methods for matching, they do exactly same thing as exported functions, those exported functions call these methods underlying:
|
|
|
|
func (re *Regexp) Match(b []byte) bool
|
|
func (re *Regexp) MatchReader(r io.RuneReader) bool
|
|
func (re *Regexp) MatchString(s string) bool
|
|
|
|
Next, let's see how to do displacement through Regexp:
|
|
|
|
func (re *Regexp) ReplaceAll(src, repl []byte) []byte
|
|
func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte
|
|
func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte
|
|
func (re *Regexp) ReplaceAllLiteralString(src, repl string) string
|
|
func (re *Regexp) ReplaceAllString(src, repl string) string
|
|
func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string
|
|
|
|
These are used in crawl example, so we don't explain more here.
|
|
|
|
Let's take a look at explanation of `Expand`:
|
|
|
|
func (re *Regexp) Expand(dst []byte, template []byte, src []byte, match []int) []byte
|
|
func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte
|
|
|
|
So how to use `Expand`?
|
|
|
|
func main() {
|
|
src := []byte(`
|
|
call hello alice
|
|
hello bob
|
|
call hello eve
|
|
`)
|
|
pat := regexp.MustCompile(`(?m)(call)\s+(?P<cmd>\w+)\s+(?P<arg>.+)\s*$`)
|
|
res := []byte{}
|
|
for _, s := range pat.FindAllSubmatchIndex(src, -1) {
|
|
res = pat.Expand(res, []byte("$cmd('$arg')\n"), src, s)
|
|
}
|
|
fmt.Println(string(res))
|
|
}
|
|
|
|
At this point, you learned whole package `regexp` in Go, I hope you can understand more by studying examples of key methods, and do something interesting by yourself.
|
|
|
|
## Links
|
|
|
|
- [Directory](preface.md)
|
|
- Previous section: [JSON](07.2.md)
|
|
- Next section: [Templates](07.4.md) |