模糊匹配一般适用于文本类型字段的处理,用于全文索引的字段
一、数据准备
先构建一些数据,用来做数据匹配试验
1.定义结构
定义一个基本的用户index
PUT user
{
"mappings": {
"_doc":{
"properties":{
"user_id":{
"type":"long"
},
"nickname":{
"type":"text"
},
"account":{
"type":"keyword"
},
"password":{
"type":"text",
"index":false
},
"email":{
"type":"text"
},
"avatar":{
"type":"text",
"index":false
},
"status":{
"type":"integer"
},
"tags":{
"type":"text"
},
"address":{
"type":"text"
},
"create_time":{
"type":"date"
}
}
}
}
}
2.插入数据:
POST _bulk
{"create":{"_index":"user","_type":"_doc","_id":1}}
{"user_id":1,"nickname":"shixinke","account":"shixinke","password":"abc","email":"i@withec.net","avatar":"http://avatar.shixinke.com/images/20190410457845781.png","status":1,"tags":["技术宅","文艺"], "address":"HangZhou,ZheJiang,China", "create_time":1554886662618}
{"create":{"_index":"user","_type":"_doc","_id":2}}
{"user_id":2,"nickname":"withec","account":"withec","password":"abceee","email":"withec@withec.com","avatar":"http://avatar.shixinke.com/images/20190410457145781.png","status":0,"tags":["活泼","运动型"],"address":"ShiYan,HuBei,China","create_time":1554886682618}
{"create":{"_index":"user","_type":"_doc","_id":3}}
{"user_id":3,"nickname":"lucy","account":"lucy","password":"abceee","email":"lucy@google.com","avatar":"http://avatar.google.com/images/20190410457145781.png","status":0,"tags":["安静","文艺"],"address":"JiNan,ShanDong,China","create_time":1554886782618}
{"create":{"_index":"user","_type":"_doc","_id":4}}
{"user_id":4,"nickname":"lilei","account":"lilei","password":"aebceeee","email":"lilei@live.com","avatar":"http://avatar.live.com/images/20190410457145781.png","status":1,"tags":["旅行","读书"],"address":"PuDong,ShangHai,China","create_time":1554889782618}
{"create":{"_index":"user","_type":"_doc","_id":5}}
{"user_id":5,"nickname":"jet","account":"jet","password":"aebceeee","email":"jet@sina.com","avatar":"http://avatar.sina.com/images/20190410457145781.png","status":1,"tags":["固执","文艺"],"address":"NingBo,ZheJiang,China","create_time":1554882782618}
{"create":{"_index":"user","_type":"_doc","_id":6}}
{"user_id":6,"nickname":"shixin","account":"shixin","password":"aebceeee","email":"shixinke@withec.net","avatar":"http://avatar.withec.com/images/20190410457145781.png","status":1,"tags":["天真","文艺"],"address":"SuZhou,JIangSu,China","create_time":1554882752618}
二.match使用
1.match使用
- 功能:对搜索的字符串进行分词,从目标字段中的倒序索引(目标字段分词后的集合,并与文档主键形成的关联映射)中查找是否有其中某个分词.
- 适用范围:
- 多值的文本
- 可分词的文本
(1)对于分词的字段
- address 字段是text类型,默认是会分词,我们可以使用_analyze这个API来查看地址会被分成什么的样的
POST _analyze
{
"text":"HangZhou,ZheJiang,China"
}
注:使用_analyze这个API,默认使用的是标准英文分词,如果是中文分词,需要指定分词器,以上的文本内容,根据,分成了三个词
{
"tokens" : [
{
"token" : "hangzhou",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "zhejiang",
"start_offset" : 9,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "china",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}
因此只要查询的条件中,将条件的内容再进行分词,与目标字段内容的分词词组进行对比
GET user/_doc/_search
{
"_source":["user_id","address"],
"query":{
"match":{
"address":"zhejiang"
}
}
}
得出以下的结果:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "5",
"_score" : 0.2876821,
"_source" : {
"address" : "NingBo,ZheJiang,China",
"user_id" : 5
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"address" : "HangZhou,ZheJiang,China",
"user_id" : 1
}
}
]
}
}
(2)对于多值的字段
本例中tags字段是一个多值的字段
GET user/_doc/_search
{
"query":{
"match":{
"tags":"运"
}
}
}
搜索结果:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.9227539,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.9227539,
"_source" : {
"user_id" : 2,
"nickname" : "withec",
"account" : "withec",
"password" : "abceee",
"email" : "withec@withec.com",
"avatar" : "http://avatar.shixinke.com/images/20190410457145781.png",
"status" : 0,
"tags" : [
"活泼",
"运动型"
],
"address" : "ShiYan,HuBei,China",
"create_time" : 1554886682618
}
}
]
}
}
注:因为我们在建立mapping时,没有指定分词器,默认的分词器会把中文按字分割
(3)match的复杂用法
match条件还支持以下参数:
- query : 指定匹配的值
- operator : 匹配条件类型
- and : 条件分词后都要匹配
- or : 条件分词后有一个匹配即可(默认)
- minmum_should_match : 指定最小匹配的数量
A.默认情况下或operator为or的情况是只要包含其中一个条件即可
GET user/_doc/_search
{
"_source":["user_id", "tags"],
"query":{
"match":{
"tags":{
"query":"运,艺"
}
}
}
}
- 条件中的”运,艺”会被分词为[“运”,”艺”]这样的集合
- tags字段会根据默认分词,形成一个词组集合,如 [“活泼”,”运动型”]会被分词为[“活”,”动”,”运”,”动”,”型”]这样的词组集合
- 因此只需要比较条件中的词组集合和目标字段中的词组集合即可,即tags字段中只要包含”运”或者”艺”即可
因此上面的条件会以下结果:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 5,
"max_score" : 1.0126973,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "6",
"_score" : 1.0126973,
"_source" : {
"user_id" : 6,
"tags" : [
"天真",
"文艺"
]
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.9227539,
"_source" : {
"user_id" : 2,
"tags" : [
"活泼",
"运动型"
]
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "5",
"_score" : 0.2876821,
"_source" : {
"user_id" : 5,
"tags" : [
"固执",
"文艺"
]
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"user_id" : 1,
"tags" : [
"技术宅",
"文艺"
]
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.2876821,
"_source" : {
"user_id" : 3,
"tags" : [
"安静",
"文艺"
]
}
}
]
}
}
B.通过operator中的and控制匹配结果
GET user/_doc/_search
{
"_source":["user_id", "tags"],
"query":{
"match":{
"tags":{
"query":"天,艺",
"operator":"and"
}
}
}
}
- 条件中的”天,艺”会被分词为[“天”,”艺”]这样的集合
- tags字段会根据默认分词形成一个词组集合
- 因为operator是and,所有tags的词组集合必须包含条件中两个词组才能满足条件
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 2.0253947,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "6",
"_score" : 2.0253947,
"_source" : {
"user_id" : 6,
"tags" : [
"天真",
"文艺"
]
}
}
]
}
}
C.通过minimum_shoud_match控制匹配结果
GET user/_doc/_search
{
"_source":["user_id", "tags"],
"query":{
"match":{
"tags":{
"query":"天,艺,运",
"minimum_should_match":2
}
}
}
}
- 条件中的”天,艺,运”会被分词为[“天”,”艺”,”运”]这样的集合
- tags字段会根据默认分词形成一个词组集合
- 因此只要tags字段的分词后的词组至少包含[“天”,”艺”,”运”]这个集合其中两个词就可满足条件
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 2.0253947,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "6",
"_score" : 2.0253947,
"_source" : {
"user_id" : 6,
"tags" : [
"天真",
"文艺"
]
}
}
]
}
}
2.match_pharse
- 功能:完全包含搜索的内容
GET user/_doc/_search
{
"_source":["user_id","tags"],
"query":{
"match_phrase":{
"tags":"艺术"
}
}
}
- 查找必须包含”艺术”这个词,与match区别是比较大的(个人理解,目标的字段必须完全包含搜索的词语,而不是分词后的匹配结果)
下面与match比较一下:
GET user/_doc/_search
{
"_source":["user_id","tags"],
"query":{
"match":{
"tags":{
"query":"艺术",
"operator":"and"
}
}
}
}
- match可以查找到tags包含”艺”和”术”的记录
3.multi_match
- 功能:用于搜索多个字段匹配同一个内容(内容会被分词)
- 参数:
- query : 匹配的值
- fields : 查找的字段范围
- type : 过滤筛选的类型
- best_fields : 只要匹配任意一个字段即可,使用最匹配的那个字段的相关度评分
- most_fields : 只要匹配做任意一个字段,但会将匹配度的得分进行组合
- corss_fields : 使用相同的分词器,只要有一个字段匹配即可
- phrase : 最匹配的字段要完全匹配搜索的内容
- phrase_prefix : 最匹配的字段要完全匹配搜索的内容(包含搜索的的内容)
- operator : 匹配的字段关系
- and : 所有字段都匹配
- or : 只要一个字段匹配即可
GET user/_doc/_search
{
"_source":["user_id","nickname","email"],
"query":{
"multi_match":{
"query":"shixinke",
"fields":["email","nickname"]
}
}
}
- 从用户index的email,nickname等字段中查找”shixinke”这个关键词
三、prefix
请求:
GET user/_doc/_search
{
"_source":["user_id","email"],
"query":{
"prefix":{
"email":"shixin"
}
}
}
匹配email字段以shixin
为前缀的记录
查询结果:
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "6",
"_score" : 1.0,
"_source" : {
"user_id" : 6,
"email" : "shixinke@withec.net"
}
}
]
}
}
四、regexp正则匹配
1.功能
使用正则表达式来搜索指定的内容
2.用法
- 参数:
- value : 正则表达式
- flags : 启用的选项
- ALL: 启用所有选项
- ANYSTRING : 使用@表示整个字符串
- AUTOMATION : 启用保留字符,开始<和>的转义
- COMPLEMENT : 使用~表示任意长度且不是~后面的字符的
- EMPTY : 开启#转义
- INTERSECTION : 启用
&
连接两个模式,必须同时匹配两个模式 - INTERNAL : 启用<>括起来的数字范围
GET user/_doc/_search
{
"_source":["user_id","nickname","email"],
"query":{
"regexp":{
"nickname":{
"value":"l.*"
}
}
}
}
- l.表示以l开始后面跟任意字符(.表示任意字符,表示.表示的字符出现0次、1次、1次以上)
查询结果:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"user_id" : 4,
"nickname" : "lilei",
"email" : "lilei@live.com"
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"user_id" : 3,
"nickname" : "lucy",
"email" : "lucy@google.com"
}
}
]
}
}