ElasticSearch扫盲之九:ElasticSearch结构化查询语句Query DSL之模糊查询即match查询

模糊匹配一般适用于文本类型字段的处理,用于全文索引的字段

一、数据准备

先构建一些数据,用来做数据匹配试验

1.定义结构

定义一个基本的用户index

  1. PUT user
  2. {
  3. "mappings": {
  4. "_doc":{
  5. "properties":{
  6. "user_id":{
  7. "type":"long"
  8. },
  9. "nickname":{
  10. "type":"text"
  11. },
  12. "account":{
  13. "type":"keyword"
  14. },
  15. "password":{
  16. "type":"text",
  17. "index":false
  18. },
  19. "email":{
  20. "type":"text"
  21. },
  22. "avatar":{
  23. "type":"text",
  24. "index":false
  25. },
  26. "status":{
  27. "type":"integer"
  28. },
  29. "tags":{
  30. "type":"text"
  31. },
  32. "address":{
  33. "type":"text"
  34. },
  35. "create_time":{
  36. "type":"date"
  37. }
  38. }
  39. }
  40. }
  41. }
2.插入数据:
  1. POST _bulk
  2. {"create":{"_index":"user","_type":"_doc","_id":1}}
  3. {"user_id":1,"nickname":"shixinke","account":"shixinke","password":"abc","email":"i@withec.net","avatar":"http://avatar.shixinke.com/images/20190410457845781.png","status":1,"tags":["技术宅","文艺"], "address":"HangZhou,ZheJiang,China", "create_time":1554886662618}
  4. {"create":{"_index":"user","_type":"_doc","_id":2}}
  5. {"user_id":2,"nickname":"withec","account":"withec","password":"abceee","email":"withec@withec.com","avatar":"http://avatar.shixinke.com/images/20190410457145781.png","status":0,"tags":["活泼","运动型"],"address":"ShiYan,HuBei,China","create_time":1554886682618}
  6. {"create":{"_index":"user","_type":"_doc","_id":3}}
  7. {"user_id":3,"nickname":"lucy","account":"lucy","password":"abceee","email":"lucy@google.com","avatar":"http://avatar.google.com/images/20190410457145781.png","status":0,"tags":["安静","文艺"],"address":"JiNan,ShanDong,China","create_time":1554886782618}
  8. {"create":{"_index":"user","_type":"_doc","_id":4}}
  9. {"user_id":4,"nickname":"lilei","account":"lilei","password":"aebceeee","email":"lilei@live.com","avatar":"http://avatar.live.com/images/20190410457145781.png","status":1,"tags":["旅行","读书"],"address":"PuDong,ShangHai,China","create_time":1554889782618}
  10. {"create":{"_index":"user","_type":"_doc","_id":5}}
  11. {"user_id":5,"nickname":"jet","account":"jet","password":"aebceeee","email":"jet@sina.com","avatar":"http://avatar.sina.com/images/20190410457145781.png","status":1,"tags":["固执","文艺"],"address":"NingBo,ZheJiang,China","create_time":1554882782618}
  12. {"create":{"_index":"user","_type":"_doc","_id":6}}
  13. {"user_id":6,"nickname":"shixin","account":"shixin","password":"aebceeee","email":"shixinke@withec.net","avatar":"http://avatar.withec.com/images/20190410457145781.png","status":1,"tags":["天真","文艺"],"address":"SuZhou,JIangSu,China","create_time":1554882752618}

二.match使用

1.match使用
  • 功能:对搜索的字符串进行分词,从目标字段中的倒序索引(目标字段分词后的集合,并与文档主键形成的关联映射)中查找是否有其中某个分词.
  • 适用范围:
    • 多值的文本
    • 可分词的文本
(1)对于分词的字段
  • address 字段是text类型,默认是会分词,我们可以使用_analyze这个API来查看地址会被分成什么的样的
  1. POST _analyze
  2. {
  3. "text":"HangZhou,ZheJiang,China"
  4. }

注:使用_analyze这个API,默认使用的是标准英文分词,如果是中文分词,需要指定分词器,以上的文本内容,根据,分成了三个词

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "hangzhou",
  5. "start_offset" : 0,
  6. "end_offset" : 8,
  7. "type" : "<ALPHANUM>",
  8. "position" : 0
  9. },
  10. {
  11. "token" : "zhejiang",
  12. "start_offset" : 9,
  13. "end_offset" : 17,
  14. "type" : "<ALPHANUM>",
  15. "position" : 1
  16. },
  17. {
  18. "token" : "china",
  19. "start_offset" : 18,
  20. "end_offset" : 23,
  21. "type" : "<ALPHANUM>",
  22. "position" : 2
  23. }
  24. ]
  25. }

因此只要查询的条件中,将条件的内容再进行分词,与目标字段内容的分词词组进行对比

  1. GET user/_doc/_search
  2. {
  3. "_source":["user_id","address"],
  4. "query":{
  5. "match":{
  6. "address":"zhejiang"
  7. }
  8. }
  9. }

得出以下的结果:

  1. {
  2. "took" : 0,
  3. "timed_out" : false,
  4. "_shards" : {
  5. "total" : 5,
  6. "successful" : 5,
  7. "skipped" : 0,
  8. "failed" : 0
  9. },
  10. "hits" : {
  11. "total" : 2,
  12. "max_score" : 0.2876821,
  13. "hits" : [
  14. {
  15. "_index" : "user",
  16. "_type" : "_doc",
  17. "_id" : "5",
  18. "_score" : 0.2876821,
  19. "_source" : {
  20. "address" : "NingBo,ZheJiang,China",
  21. "user_id" : 5
  22. }
  23. },
  24. {
  25. "_index" : "user",
  26. "_type" : "_doc",
  27. "_id" : "1",
  28. "_score" : 0.2876821,
  29. "_source" : {
  30. "address" : "HangZhou,ZheJiang,China",
  31. "user_id" : 1
  32. }
  33. }
  34. ]
  35. }
  36. }
(2)对于多值的字段

本例中tags字段是一个多值的字段

  1. GET user/_doc/_search
  2. {
  3. "query":{
  4. "match":{
  5. "tags":"运"
  6. }
  7. }
  8. }

搜索结果:

  1. {
  2. "took" : 0,
  3. "timed_out" : false,
  4. "_shards" : {
  5. "total" : 5,
  6. "successful" : 5,
  7. "skipped" : 0,
  8. "failed" : 0
  9. },
  10. "hits" : {
  11. "total" : 1,
  12. "max_score" : 0.9227539,
  13. "hits" : [
  14. {
  15. "_index" : "user",
  16. "_type" : "_doc",
  17. "_id" : "2",
  18. "_score" : 0.9227539,
  19. "_source" : {
  20. "user_id" : 2,
  21. "nickname" : "withec",
  22. "account" : "withec",
  23. "password" : "abceee",
  24. "email" : "withec@withec.com",
  25. "avatar" : "http://avatar.shixinke.com/images/20190410457145781.png",
  26. "status" : 0,
  27. "tags" : [
  28. "活泼",
  29. "运动型"
  30. ],
  31. "address" : "ShiYan,HuBei,China",
  32. "create_time" : 1554886682618
  33. }
  34. }
  35. ]
  36. }
  37. }

注:因为我们在建立mapping时,没有指定分词器,默认的分词器会把中文按字分割

(3)match的复杂用法

match条件还支持以下参数:

  • query : 指定匹配的值
  • operator : 匹配条件类型
    • and : 条件分词后都要匹配
    • or : 条件分词后有一个匹配即可(默认)
  • minmum_should_match : 指定最小匹配的数量

A.默认情况下或operator为or的情况是只要包含其中一个条件即可

  1. GET user/_doc/_search
  2. {
  3. "_source":["user_id", "tags"],
  4. "query":{
  5. "match":{
  6. "tags":{
  7. "query":"运,艺"
  8. }
  9. }
  10. }
  11. }
  • 条件中的”运,艺”会被分词为[“运”,”艺”]这样的集合
  • tags字段会根据默认分词,形成一个词组集合,如 [“活泼”,”运动型”]会被分词为[“活”,”动”,”运”,”动”,”型”]这样的词组集合
  • 因此只需要比较条件中的词组集合和目标字段中的词组集合即可,即tags字段中只要包含”运”或者”艺”即可

因此上面的条件会以下结果:

  1. {
  2. "took" : 4,
  3. "timed_out" : false,
  4. "_shards" : {
  5. "total" : 5,
  6. "successful" : 5,
  7. "skipped" : 0,
  8. "failed" : 0
  9. },
  10. "hits" : {
  11. "total" : 5,
  12. "max_score" : 1.0126973,
  13. "hits" : [
  14. {
  15. "_index" : "user",
  16. "_type" : "_doc",
  17. "_id" : "6",
  18. "_score" : 1.0126973,
  19. "_source" : {
  20. "user_id" : 6,
  21. "tags" : [
  22. "天真",
  23. "文艺"
  24. ]
  25. }
  26. },
  27. {
  28. "_index" : "user",
  29. "_type" : "_doc",
  30. "_id" : "2",
  31. "_score" : 0.9227539,
  32. "_source" : {
  33. "user_id" : 2,
  34. "tags" : [
  35. "活泼",
  36. "运动型"
  37. ]
  38. }
  39. },
  40. {
  41. "_index" : "user",
  42. "_type" : "_doc",
  43. "_id" : "5",
  44. "_score" : 0.2876821,
  45. "_source" : {
  46. "user_id" : 5,
  47. "tags" : [
  48. "固执",
  49. "文艺"
  50. ]
  51. }
  52. },
  53. {
  54. "_index" : "user",
  55. "_type" : "_doc",
  56. "_id" : "1",
  57. "_score" : 0.2876821,
  58. "_source" : {
  59. "user_id" : 1,
  60. "tags" : [
  61. "技术宅",
  62. "文艺"
  63. ]
  64. }
  65. },
  66. {
  67. "_index" : "user",
  68. "_type" : "_doc",
  69. "_id" : "3",
  70. "_score" : 0.2876821,
  71. "_source" : {
  72. "user_id" : 3,
  73. "tags" : [
  74. "安静",
  75. "文艺"
  76. ]
  77. }
  78. }
  79. ]
  80. }
  81. }

B.通过operator中的and控制匹配结果

  1. GET user/_doc/_search
  2. {
  3. "_source":["user_id", "tags"],
  4. "query":{
  5. "match":{
  6. "tags":{
  7. "query":"天,艺",
  8. "operator":"and"
  9. }
  10. }
  11. }
  12. }
  • 条件中的”天,艺”会被分词为[“天”,”艺”]这样的集合
  • tags字段会根据默认分词形成一个词组集合
  • 因为operator是and,所有tags的词组集合必须包含条件中两个词组才能满足条件
  1. {
  2. "took" : 1,
  3. "timed_out" : false,
  4. "_shards" : {
  5. "total" : 5,
  6. "successful" : 5,
  7. "skipped" : 0,
  8. "failed" : 0
  9. },
  10. "hits" : {
  11. "total" : 1,
  12. "max_score" : 2.0253947,
  13. "hits" : [
  14. {
  15. "_index" : "user",
  16. "_type" : "_doc",
  17. "_id" : "6",
  18. "_score" : 2.0253947,
  19. "_source" : {
  20. "user_id" : 6,
  21. "tags" : [
  22. "天真",
  23. "文艺"
  24. ]
  25. }
  26. }
  27. ]
  28. }
  29. }

C.通过minimum_shoud_match控制匹配结果

  1. GET user/_doc/_search
  2. {
  3. "_source":["user_id", "tags"],
  4. "query":{
  5. "match":{
  6. "tags":{
  7. "query":"天,艺,运",
  8. "minimum_should_match":2
  9. }
  10. }
  11. }
  12. }
  • 条件中的”天,艺,运”会被分词为[“天”,”艺”,”运”]这样的集合
  • tags字段会根据默认分词形成一个词组集合
  • 因此只要tags字段的分词后的词组至少包含[“天”,”艺”,”运”]这个集合其中两个词就可满足条件
  1. {
  2. "took" : 1,
  3. "timed_out" : false,
  4. "_shards" : {
  5. "total" : 5,
  6. "successful" : 5,
  7. "skipped" : 0,
  8. "failed" : 0
  9. },
  10. "hits" : {
  11. "total" : 1,
  12. "max_score" : 2.0253947,
  13. "hits" : [
  14. {
  15. "_index" : "user",
  16. "_type" : "_doc",
  17. "_id" : "6",
  18. "_score" : 2.0253947,
  19. "_source" : {
  20. "user_id" : 6,
  21. "tags" : [
  22. "天真",
  23. "文艺"
  24. ]
  25. }
  26. }
  27. ]
  28. }
  29. }
2.match_pharse
  • 功能:完全包含搜索的内容
  1. GET user/_doc/_search
  2. {
  3. "_source":["user_id","tags"],
  4. "query":{
  5. "match_phrase":{
  6. "tags":"艺术"
  7. }
  8. }
  9. }
  • 查找必须包含”艺术”这个词,与match区别是比较大的(个人理解,目标的字段必须完全包含搜索的词语,而不是分词后的匹配结果)

下面与match比较一下:

  1. GET user/_doc/_search
  2. {
  3. "_source":["user_id","tags"],
  4. "query":{
  5. "match":{
  6. "tags":{
  7. "query":"艺术",
  8. "operator":"and"
  9. }
  10. }
  11. }
  12. }
  • match可以查找到tags包含”艺”和”术”的记录
3.multi_match
  • 功能:用于搜索多个字段匹配同一个内容(内容会被分词)
  • 参数:
    • query : 匹配的值
    • fields : 查找的字段范围
    • type : 过滤筛选的类型
      • best_fields : 只要匹配任意一个字段即可,使用最匹配的那个字段的相关度评分
      • most_fields : 只要匹配做任意一个字段,但会将匹配度的得分进行组合
      • corss_fields : 使用相同的分词器,只要有一个字段匹配即可
      • phrase : 最匹配的字段要完全匹配搜索的内容
      • phrase_prefix : 最匹配的字段要完全匹配搜索的内容(包含搜索的的内容)
    • operator : 匹配的字段关系
      • and : 所有字段都匹配
      • or : 只要一个字段匹配即可
  1. GET user/_doc/_search
  2. {
  3. "_source":["user_id","nickname","email"],
  4. "query":{
  5. "multi_match":{
  6. "query":"shixinke",
  7. "fields":["email","nickname"]
  8. }
  9. }
  10. }
  • 从用户index的email,nickname等字段中查找”shixinke”这个关键词

三、prefix

请求:

  1. GET user/_doc/_search
  2. {
  3. "_source":["user_id","email"],
  4. "query":{
  5. "prefix":{
  6. "email":"shixin"
  7. }
  8. }
  9. }

匹配email字段以shixin为前缀的记录

查询结果:

  1. {
  2. "took" : 10,
  3. "timed_out" : false,
  4. "_shards" : {
  5. "total" : 5,
  6. "successful" : 5,
  7. "skipped" : 0,
  8. "failed" : 0
  9. },
  10. "hits" : {
  11. "total" : 1,
  12. "max_score" : 1.0,
  13. "hits" : [
  14. {
  15. "_index" : "user",
  16. "_type" : "_doc",
  17. "_id" : "6",
  18. "_score" : 1.0,
  19. "_source" : {
  20. "user_id" : 6,
  21. "email" : "shixinke@withec.net"
  22. }
  23. }
  24. ]
  25. }
  26. }

四、regexp正则匹配

1.功能

使用正则表达式来搜索指定的内容

2.用法
  • 参数:
    • value : 正则表达式
    • flags : 启用的选项
    • ALL: 启用所有选项
    • ANYSTRING : 使用@表示整个字符串
    • AUTOMATION : 启用保留字符,开始<和>的转义
    • COMPLEMENT : 使用~表示任意长度且不是~后面的字符的
    • EMPTY : 开启#转义
    • INTERSECTION : 启用&连接两个模式,必须同时匹配两个模式
    • INTERNAL : 启用<>括起来的数字范围
  1. GET user/_doc/_search
  2. {
  3. "_source":["user_id","nickname","email"],
  4. "query":{
  5. "regexp":{
  6. "nickname":{
  7. "value":"l.*"
  8. }
  9. }
  10. }
  11. }
  • l.表示以l开始后面跟任意字符(.表示任意字符,表示.表示的字符出现0次、1次、1次以上)

查询结果:

  1. {
  2. "took" : 5,
  3. "timed_out" : false,
  4. "_shards" : {
  5. "total" : 5,
  6. "successful" : 5,
  7. "skipped" : 0,
  8. "failed" : 0
  9. },
  10. "hits" : {
  11. "total" : 2,
  12. "max_score" : 1.0,
  13. "hits" : [
  14. {
  15. "_index" : "user",
  16. "_type" : "_doc",
  17. "_id" : "4",
  18. "_score" : 1.0,
  19. "_source" : {
  20. "user_id" : 4,
  21. "nickname" : "lilei",
  22. "email" : "lilei@live.com"
  23. }
  24. },
  25. {
  26. "_index" : "user",
  27. "_type" : "_doc",
  28. "_id" : "3",
  29. "_score" : 1.0,
  30. "_source" : {
  31. "user_id" : 3,
  32. "nickname" : "lucy",
  33. "email" : "lucy@google.com"
  34. }
  35. }
  36. ]
  37. }
  38. }