Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram










3














Suppose there is the following mapping with Edge NGram Tokenizer:




"settings":
"analysis":
"analyzer":
"autocomplete_analyzer":
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
,
"autocomplete_search":
"tokenizer": "whitespace"

,
"tokenizer":
"autocomplete_tokenizer":
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]



,
"mappings":
"tag":
"properties":
"id":
"type": "long"
,
"name":
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"







And the following documents are indexed:



POST /tag/tag/_bulk
"index":
"name" : "HITS FIND SOME"
"index":
"name" : "TRENDING HI"
"index":
"name" : "HITS OTHER"


Then searching




"query":
"match":
"name":
"query": "HI"






yields all with the same score, or TRENDING - HI with a score higher than one of the others.



How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME and HITS OTHER to have a higher score than TRENDING HI; at the same time TRENDING HI should be in the results.



Highlighter is also used, so the given solution shouldn't mess it up.



The highlighter used in query is:



 "highlight": 
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields":
"name":




Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.










share|improve this question




























    3














    Suppose there is the following mapping with Edge NGram Tokenizer:




    "settings":
    "analysis":
    "analyzer":
    "autocomplete_analyzer":
    "tokenizer": "autocomplete_tokenizer",
    "filter": [
    "standard"
    ]
    ,
    "autocomplete_search":
    "tokenizer": "whitespace"

    ,
    "tokenizer":
    "autocomplete_tokenizer":
    "type": "edge_ngram",
    "min_gram": 1,
    "max_gram": 10,
    "token_chars": [
    "letter",
    "symbol"
    ]



    ,
    "mappings":
    "tag":
    "properties":
    "id":
    "type": "long"
    ,
    "name":
    "type": "text",
    "analyzer": "autocomplete_analyzer",
    "search_analyzer": "autocomplete_search"







    And the following documents are indexed:



    POST /tag/tag/_bulk
    "index":
    "name" : "HITS FIND SOME"
    "index":
    "name" : "TRENDING HI"
    "index":
    "name" : "HITS OTHER"


    Then searching




    "query":
    "match":
    "name":
    "query": "HI"






    yields all with the same score, or TRENDING - HI with a score higher than one of the others.



    How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME and HITS OTHER to have a higher score than TRENDING HI; at the same time TRENDING HI should be in the results.



    Highlighter is also used, so the given solution shouldn't mess it up.



    The highlighter used in query is:



     "highlight": 
    "pre_tags": [
    "<"
    ],
    "post_tags": [
    ">"
    ],
    "fields":
    "name":




    Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.










    share|improve this question


























      3












      3








      3







      Suppose there is the following mapping with Edge NGram Tokenizer:




      "settings":
      "analysis":
      "analyzer":
      "autocomplete_analyzer":
      "tokenizer": "autocomplete_tokenizer",
      "filter": [
      "standard"
      ]
      ,
      "autocomplete_search":
      "tokenizer": "whitespace"

      ,
      "tokenizer":
      "autocomplete_tokenizer":
      "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 10,
      "token_chars": [
      "letter",
      "symbol"
      ]



      ,
      "mappings":
      "tag":
      "properties":
      "id":
      "type": "long"
      ,
      "name":
      "type": "text",
      "analyzer": "autocomplete_analyzer",
      "search_analyzer": "autocomplete_search"







      And the following documents are indexed:



      POST /tag/tag/_bulk
      "index":
      "name" : "HITS FIND SOME"
      "index":
      "name" : "TRENDING HI"
      "index":
      "name" : "HITS OTHER"


      Then searching




      "query":
      "match":
      "name":
      "query": "HI"






      yields all with the same score, or TRENDING - HI with a score higher than one of the others.



      How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME and HITS OTHER to have a higher score than TRENDING HI; at the same time TRENDING HI should be in the results.



      Highlighter is also used, so the given solution shouldn't mess it up.



      The highlighter used in query is:



       "highlight": 
      "pre_tags": [
      "<"
      ],
      "post_tags": [
      ">"
      ],
      "fields":
      "name":




      Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.










      share|improve this question















      Suppose there is the following mapping with Edge NGram Tokenizer:




      "settings":
      "analysis":
      "analyzer":
      "autocomplete_analyzer":
      "tokenizer": "autocomplete_tokenizer",
      "filter": [
      "standard"
      ]
      ,
      "autocomplete_search":
      "tokenizer": "whitespace"

      ,
      "tokenizer":
      "autocomplete_tokenizer":
      "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 10,
      "token_chars": [
      "letter",
      "symbol"
      ]



      ,
      "mappings":
      "tag":
      "properties":
      "id":
      "type": "long"
      ,
      "name":
      "type": "text",
      "analyzer": "autocomplete_analyzer",
      "search_analyzer": "autocomplete_search"







      And the following documents are indexed:



      POST /tag/tag/_bulk
      "index":
      "name" : "HITS FIND SOME"
      "index":
      "name" : "TRENDING HI"
      "index":
      "name" : "HITS OTHER"


      Then searching




      "query":
      "match":
      "name":
      "query": "HI"






      yields all with the same score, or TRENDING - HI with a score higher than one of the others.



      How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME and HITS OTHER to have a higher score than TRENDING HI; at the same time TRENDING HI should be in the results.



      Highlighter is also used, so the given solution shouldn't mess it up.



      The highlighter used in query is:



       "highlight": 
      "pre_tags": [
      "<"
      ],
      "post_tags": [
      ">"
      ],
      "fields":
      "name":




      Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.







      elasticsearch search n-gram






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 12 at 16:35

























      asked Nov 10 at 11:44









      m3th0dman

      5,55333566




      5,55333566






















          3 Answers
          3






          active

          oldest

          votes


















          4





          +50









          You must understand how elasticsearch/lucene analyzes your data and calculate the search score.



          1. Analyze API



          https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:



          T / TR / TRE /.... TRENDING / / H / HI


          2. Score



          https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html



          The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).



          3. dont mess highlight



          According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query



          You can add an extra query:




          "query":
          "bool":
          "must" : [

          "match":
          "name": "HI"


          ],
          "should": [

          "prefix":
          "name": "HI"


          ]

          ,
          "highlight":
          "pre_tags": [
          "<"
          ],
          "post_tags": [
          ">"
          ],
          "fields":
          "name":
          "highlight_query":
          "match":
          "name": "HI"











          share|improve this answer






























            3














            In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:




            "query":
            "bool":
            "should": [

            "match":
            "name": "HI"

            ,

            "match_phrase_prefix":
            "name": "HI"


            ]





            The match term will match on all three results, but the match_phrase_prefix won't match on TRENDING HI. As a result, you'll get all three items in the results, but TRENDING HI will appear with a lower score.



            Quoting the docs:




            The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.




            On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.






            share|improve this answer






















            • But I need TRENDING HI as a result; just with a lower score.
              – m3th0dman
              Nov 11 at 10:54






            • 1




              @m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
              – AdrienF
              Nov 11 at 14:04










            • Thank you for your answer!
              – m3th0dman
              Nov 12 at 11:38











            • Unfortunately this messes up the highlighter.
              – m3th0dman
              Nov 12 at 14:12










            • @m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
              – AdrienF
              Nov 12 at 14:36



















            1














            A possible solution for this problem is to use multifields. They allow for indexing of the same data from your source document in different ways. In your case you could index the name field as default text, then as ngrams and also as edgengrams. Then the query would have to be a bool query comparing with all those different fields.



            The final score of documents is composed of the match value for each one. Those matches are also called signals, signalling that there is a match between the query and the document. The document with most signals matching gets the highest score.



            In your case all documents would match the ngram HI. But only the HITS FIND SOME and the HITS OTHER document would get the edgengram additional score. This would give those two documents a boost and put them on top. The complication with this is that you have to make sure that the edgengram doesn't split on whitespaces, because then the HI at the end would get the same score as in the beginning of the document.



            Here is an example mapping and query for your case:



            PUT /tag/

            "settings":
            "analysis":
            "analyzer":
            "edge_analyzer":
            "tokenizer": "edge_tokenizer"
            ,
            "kw_analyzer":
            "tokenizer": "kw_tokenizer"
            ,
            "ngram_analyzer":
            "tokenizer": "ngram_tokenizer"
            ,
            "autocomplete_analyzer":
            "tokenizer": "autocomplete_tokenizer",
            "filter": [
            "standard"
            ]
            ,
            "autocomplete_search":
            "tokenizer": "whitespace"

            ,
            "tokenizer":
            "kw_tokenizer":
            "type": "keyword"
            ,
            "edge_tokenizer":
            "type": "edge_ngram",
            "min_gram": 2,
            "max_gram": 10
            ,
            "ngram_tokenizer":
            "type": "ngram",
            "min_gram": 2,
            "max_gram": 10,
            "token_chars": [
            "letter",
            "digit"
            ]
            ,
            "autocomplete_tokenizer":
            "type": "edge_ngram",
            "min_gram": 1,
            "max_gram": 10,
            "token_chars": [
            "letter",
            "symbol"
            ]



            ,
            "mappings":
            "tag":
            "properties":
            "id":
            "type": "long"
            ,
            "name":
            "type": "text",
            "fields":
            "edge":
            "type": "text",
            "analyzer": "edge_analyzer"
            ,
            "ngram":
            "type": "text",
            "analyzer": "ngram_analyzer"









            And a query:



            POST /tag/_search

            "query":
            "bool":
            "should": [

            "function_score":
            "query":
            "match":
            "name.edge":
            "query": "HI"


            ,
            "boost": "5",
            "boost_mode": "multiply"

            ,

            "match":
            "name.ngram":
            "query": "HI"


            ,

            "match":
            "name":
            "query": "HI"



            ]








            share|improve this answer






















              Your Answer






              StackExchange.ifUsing("editor", function ()
              StackExchange.using("externalEditor", function ()
              StackExchange.using("snippets", function ()
              StackExchange.snippets.init();
              );
              );
              , "code-snippets");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "1"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238598%2felasticsearch-edge-ngram-tokenizer-higher-score-when-word-begins-with-n-gram%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              4





              +50









              You must understand how elasticsearch/lucene analyzes your data and calculate the search score.



              1. Analyze API



              https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:



              T / TR / TRE /.... TRENDING / / H / HI


              2. Score



              https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html



              The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).



              3. dont mess highlight



              According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query



              You can add an extra query:




              "query":
              "bool":
              "must" : [

              "match":
              "name": "HI"


              ],
              "should": [

              "prefix":
              "name": "HI"


              ]

              ,
              "highlight":
              "pre_tags": [
              "<"
              ],
              "post_tags": [
              ">"
              ],
              "fields":
              "name":
              "highlight_query":
              "match":
              "name": "HI"











              share|improve this answer



























                4





                +50









                You must understand how elasticsearch/lucene analyzes your data and calculate the search score.



                1. Analyze API



                https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:



                T / TR / TRE /.... TRENDING / / H / HI


                2. Score



                https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html



                The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).



                3. dont mess highlight



                According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query



                You can add an extra query:




                "query":
                "bool":
                "must" : [

                "match":
                "name": "HI"


                ],
                "should": [

                "prefix":
                "name": "HI"


                ]

                ,
                "highlight":
                "pre_tags": [
                "<"
                ],
                "post_tags": [
                ">"
                ],
                "fields":
                "name":
                "highlight_query":
                "match":
                "name": "HI"











                share|improve this answer

























                  4





                  +50







                  4





                  +50



                  4




                  +50




                  You must understand how elasticsearch/lucene analyzes your data and calculate the search score.



                  1. Analyze API



                  https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:



                  T / TR / TRE /.... TRENDING / / H / HI


                  2. Score



                  https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html



                  The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).



                  3. dont mess highlight



                  According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query



                  You can add an extra query:




                  "query":
                  "bool":
                  "must" : [

                  "match":
                  "name": "HI"


                  ],
                  "should": [

                  "prefix":
                  "name": "HI"


                  ]

                  ,
                  "highlight":
                  "pre_tags": [
                  "<"
                  ],
                  "post_tags": [
                  ">"
                  ],
                  "fields":
                  "name":
                  "highlight_query":
                  "match":
                  "name": "HI"











                  share|improve this answer














                  You must understand how elasticsearch/lucene analyzes your data and calculate the search score.



                  1. Analyze API



                  https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:



                  T / TR / TRE /.... TRENDING / / H / HI


                  2. Score



                  https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html



                  The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).



                  3. dont mess highlight



                  According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query



                  You can add an extra query:




                  "query":
                  "bool":
                  "must" : [

                  "match":
                  "name": "HI"


                  ],
                  "should": [

                  "prefix":
                  "name": "HI"


                  ]

                  ,
                  "highlight":
                  "pre_tags": [
                  "<"
                  ],
                  "post_tags": [
                  ">"
                  ],
                  "fields":
                  "name":
                  "highlight_query":
                  "match":
                  "name": "HI"












                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 12 at 17:32

























                  answered Nov 12 at 16:54









                  Thomas Decaux

                  12.6k25660




                  12.6k25660























                      3














                      In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:




                      "query":
                      "bool":
                      "should": [

                      "match":
                      "name": "HI"

                      ,

                      "match_phrase_prefix":
                      "name": "HI"


                      ]





                      The match term will match on all three results, but the match_phrase_prefix won't match on TRENDING HI. As a result, you'll get all three items in the results, but TRENDING HI will appear with a lower score.



                      Quoting the docs:




                      The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.




                      On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.






                      share|improve this answer






















                      • But I need TRENDING HI as a result; just with a lower score.
                        – m3th0dman
                        Nov 11 at 10:54






                      • 1




                        @m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
                        – AdrienF
                        Nov 11 at 14:04










                      • Thank you for your answer!
                        – m3th0dman
                        Nov 12 at 11:38











                      • Unfortunately this messes up the highlighter.
                        – m3th0dman
                        Nov 12 at 14:12










                      • @m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
                        – AdrienF
                        Nov 12 at 14:36
















                      3














                      In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:




                      "query":
                      "bool":
                      "should": [

                      "match":
                      "name": "HI"

                      ,

                      "match_phrase_prefix":
                      "name": "HI"


                      ]





                      The match term will match on all three results, but the match_phrase_prefix won't match on TRENDING HI. As a result, you'll get all three items in the results, but TRENDING HI will appear with a lower score.



                      Quoting the docs:




                      The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.




                      On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.






                      share|improve this answer






















                      • But I need TRENDING HI as a result; just with a lower score.
                        – m3th0dman
                        Nov 11 at 10:54






                      • 1




                        @m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
                        – AdrienF
                        Nov 11 at 14:04










                      • Thank you for your answer!
                        – m3th0dman
                        Nov 12 at 11:38











                      • Unfortunately this messes up the highlighter.
                        – m3th0dman
                        Nov 12 at 14:12










                      • @m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
                        – AdrienF
                        Nov 12 at 14:36














                      3












                      3








                      3






                      In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:




                      "query":
                      "bool":
                      "should": [

                      "match":
                      "name": "HI"

                      ,

                      "match_phrase_prefix":
                      "name": "HI"


                      ]





                      The match term will match on all three results, but the match_phrase_prefix won't match on TRENDING HI. As a result, you'll get all three items in the results, but TRENDING HI will appear with a lower score.



                      Quoting the docs:




                      The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.




                      On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.






                      share|improve this answer














                      In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:




                      "query":
                      "bool":
                      "should": [

                      "match":
                      "name": "HI"

                      ,

                      "match_phrase_prefix":
                      "name": "HI"


                      ]





                      The match term will match on all three results, but the match_phrase_prefix won't match on TRENDING HI. As a result, you'll get all three items in the results, but TRENDING HI will appear with a lower score.



                      Quoting the docs:




                      The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.




                      On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Nov 11 at 14:02

























                      answered Nov 10 at 14:27









                      AdrienF

                      437214




                      437214











                      • But I need TRENDING HI as a result; just with a lower score.
                        – m3th0dman
                        Nov 11 at 10:54






                      • 1




                        @m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
                        – AdrienF
                        Nov 11 at 14:04










                      • Thank you for your answer!
                        – m3th0dman
                        Nov 12 at 11:38











                      • Unfortunately this messes up the highlighter.
                        – m3th0dman
                        Nov 12 at 14:12










                      • @m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
                        – AdrienF
                        Nov 12 at 14:36

















                      • But I need TRENDING HI as a result; just with a lower score.
                        – m3th0dman
                        Nov 11 at 10:54






                      • 1




                        @m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
                        – AdrienF
                        Nov 11 at 14:04










                      • Thank you for your answer!
                        – m3th0dman
                        Nov 12 at 11:38











                      • Unfortunately this messes up the highlighter.
                        – m3th0dman
                        Nov 12 at 14:12










                      • @m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
                        – AdrienF
                        Nov 12 at 14:36
















                      But I need TRENDING HI as a result; just with a lower score.
                      – m3th0dman
                      Nov 11 at 10:54




                      But I need TRENDING HI as a result; just with a lower score.
                      – m3th0dman
                      Nov 11 at 10:54




                      1




                      1




                      @m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
                      – AdrienF
                      Nov 11 at 14:04




                      @m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
                      – AdrienF
                      Nov 11 at 14:04












                      Thank you for your answer!
                      – m3th0dman
                      Nov 12 at 11:38





                      Thank you for your answer!
                      – m3th0dman
                      Nov 12 at 11:38













                      Unfortunately this messes up the highlighter.
                      – m3th0dman
                      Nov 12 at 14:12




                      Unfortunately this messes up the highlighter.
                      – m3th0dman
                      Nov 12 at 14:12












                      @m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
                      – AdrienF
                      Nov 12 at 14:36





                      @m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
                      – AdrienF
                      Nov 12 at 14:36












                      1














                      A possible solution for this problem is to use multifields. They allow for indexing of the same data from your source document in different ways. In your case you could index the name field as default text, then as ngrams and also as edgengrams. Then the query would have to be a bool query comparing with all those different fields.



                      The final score of documents is composed of the match value for each one. Those matches are also called signals, signalling that there is a match between the query and the document. The document with most signals matching gets the highest score.



                      In your case all documents would match the ngram HI. But only the HITS FIND SOME and the HITS OTHER document would get the edgengram additional score. This would give those two documents a boost and put them on top. The complication with this is that you have to make sure that the edgengram doesn't split on whitespaces, because then the HI at the end would get the same score as in the beginning of the document.



                      Here is an example mapping and query for your case:



                      PUT /tag/

                      "settings":
                      "analysis":
                      "analyzer":
                      "edge_analyzer":
                      "tokenizer": "edge_tokenizer"
                      ,
                      "kw_analyzer":
                      "tokenizer": "kw_tokenizer"
                      ,
                      "ngram_analyzer":
                      "tokenizer": "ngram_tokenizer"
                      ,
                      "autocomplete_analyzer":
                      "tokenizer": "autocomplete_tokenizer",
                      "filter": [
                      "standard"
                      ]
                      ,
                      "autocomplete_search":
                      "tokenizer": "whitespace"

                      ,
                      "tokenizer":
                      "kw_tokenizer":
                      "type": "keyword"
                      ,
                      "edge_tokenizer":
                      "type": "edge_ngram",
                      "min_gram": 2,
                      "max_gram": 10
                      ,
                      "ngram_tokenizer":
                      "type": "ngram",
                      "min_gram": 2,
                      "max_gram": 10,
                      "token_chars": [
                      "letter",
                      "digit"
                      ]
                      ,
                      "autocomplete_tokenizer":
                      "type": "edge_ngram",
                      "min_gram": 1,
                      "max_gram": 10,
                      "token_chars": [
                      "letter",
                      "symbol"
                      ]



                      ,
                      "mappings":
                      "tag":
                      "properties":
                      "id":
                      "type": "long"
                      ,
                      "name":
                      "type": "text",
                      "fields":
                      "edge":
                      "type": "text",
                      "analyzer": "edge_analyzer"
                      ,
                      "ngram":
                      "type": "text",
                      "analyzer": "ngram_analyzer"









                      And a query:



                      POST /tag/_search

                      "query":
                      "bool":
                      "should": [

                      "function_score":
                      "query":
                      "match":
                      "name.edge":
                      "query": "HI"


                      ,
                      "boost": "5",
                      "boost_mode": "multiply"

                      ,

                      "match":
                      "name.ngram":
                      "query": "HI"


                      ,

                      "match":
                      "name":
                      "query": "HI"



                      ]








                      share|improve this answer



























                        1














                        A possible solution for this problem is to use multifields. They allow for indexing of the same data from your source document in different ways. In your case you could index the name field as default text, then as ngrams and also as edgengrams. Then the query would have to be a bool query comparing with all those different fields.



                        The final score of documents is composed of the match value for each one. Those matches are also called signals, signalling that there is a match between the query and the document. The document with most signals matching gets the highest score.



                        In your case all documents would match the ngram HI. But only the HITS FIND SOME and the HITS OTHER document would get the edgengram additional score. This would give those two documents a boost and put them on top. The complication with this is that you have to make sure that the edgengram doesn't split on whitespaces, because then the HI at the end would get the same score as in the beginning of the document.



                        Here is an example mapping and query for your case:



                        PUT /tag/

                        "settings":
                        "analysis":
                        "analyzer":
                        "edge_analyzer":
                        "tokenizer": "edge_tokenizer"
                        ,
                        "kw_analyzer":
                        "tokenizer": "kw_tokenizer"
                        ,
                        "ngram_analyzer":
                        "tokenizer": "ngram_tokenizer"
                        ,
                        "autocomplete_analyzer":
                        "tokenizer": "autocomplete_tokenizer",
                        "filter": [
                        "standard"
                        ]
                        ,
                        "autocomplete_search":
                        "tokenizer": "whitespace"

                        ,
                        "tokenizer":
                        "kw_tokenizer":
                        "type": "keyword"
                        ,
                        "edge_tokenizer":
                        "type": "edge_ngram",
                        "min_gram": 2,
                        "max_gram": 10
                        ,
                        "ngram_tokenizer":
                        "type": "ngram",
                        "min_gram": 2,
                        "max_gram": 10,
                        "token_chars": [
                        "letter",
                        "digit"
                        ]
                        ,
                        "autocomplete_tokenizer":
                        "type": "edge_ngram",
                        "min_gram": 1,
                        "max_gram": 10,
                        "token_chars": [
                        "letter",
                        "symbol"
                        ]



                        ,
                        "mappings":
                        "tag":
                        "properties":
                        "id":
                        "type": "long"
                        ,
                        "name":
                        "type": "text",
                        "fields":
                        "edge":
                        "type": "text",
                        "analyzer": "edge_analyzer"
                        ,
                        "ngram":
                        "type": "text",
                        "analyzer": "ngram_analyzer"









                        And a query:



                        POST /tag/_search

                        "query":
                        "bool":
                        "should": [

                        "function_score":
                        "query":
                        "match":
                        "name.edge":
                        "query": "HI"


                        ,
                        "boost": "5",
                        "boost_mode": "multiply"

                        ,

                        "match":
                        "name.ngram":
                        "query": "HI"


                        ,

                        "match":
                        "name":
                        "query": "HI"



                        ]








                        share|improve this answer

























                          1












                          1








                          1






                          A possible solution for this problem is to use multifields. They allow for indexing of the same data from your source document in different ways. In your case you could index the name field as default text, then as ngrams and also as edgengrams. Then the query would have to be a bool query comparing with all those different fields.



                          The final score of documents is composed of the match value for each one. Those matches are also called signals, signalling that there is a match between the query and the document. The document with most signals matching gets the highest score.



                          In your case all documents would match the ngram HI. But only the HITS FIND SOME and the HITS OTHER document would get the edgengram additional score. This would give those two documents a boost and put them on top. The complication with this is that you have to make sure that the edgengram doesn't split on whitespaces, because then the HI at the end would get the same score as in the beginning of the document.



                          Here is an example mapping and query for your case:



                          PUT /tag/

                          "settings":
                          "analysis":
                          "analyzer":
                          "edge_analyzer":
                          "tokenizer": "edge_tokenizer"
                          ,
                          "kw_analyzer":
                          "tokenizer": "kw_tokenizer"
                          ,
                          "ngram_analyzer":
                          "tokenizer": "ngram_tokenizer"
                          ,
                          "autocomplete_analyzer":
                          "tokenizer": "autocomplete_tokenizer",
                          "filter": [
                          "standard"
                          ]
                          ,
                          "autocomplete_search":
                          "tokenizer": "whitespace"

                          ,
                          "tokenizer":
                          "kw_tokenizer":
                          "type": "keyword"
                          ,
                          "edge_tokenizer":
                          "type": "edge_ngram",
                          "min_gram": 2,
                          "max_gram": 10
                          ,
                          "ngram_tokenizer":
                          "type": "ngram",
                          "min_gram": 2,
                          "max_gram": 10,
                          "token_chars": [
                          "letter",
                          "digit"
                          ]
                          ,
                          "autocomplete_tokenizer":
                          "type": "edge_ngram",
                          "min_gram": 1,
                          "max_gram": 10,
                          "token_chars": [
                          "letter",
                          "symbol"
                          ]



                          ,
                          "mappings":
                          "tag":
                          "properties":
                          "id":
                          "type": "long"
                          ,
                          "name":
                          "type": "text",
                          "fields":
                          "edge":
                          "type": "text",
                          "analyzer": "edge_analyzer"
                          ,
                          "ngram":
                          "type": "text",
                          "analyzer": "ngram_analyzer"









                          And a query:



                          POST /tag/_search

                          "query":
                          "bool":
                          "should": [

                          "function_score":
                          "query":
                          "match":
                          "name.edge":
                          "query": "HI"


                          ,
                          "boost": "5",
                          "boost_mode": "multiply"

                          ,

                          "match":
                          "name.ngram":
                          "query": "HI"


                          ,

                          "match":
                          "name":
                          "query": "HI"



                          ]








                          share|improve this answer














                          A possible solution for this problem is to use multifields. They allow for indexing of the same data from your source document in different ways. In your case you could index the name field as default text, then as ngrams and also as edgengrams. Then the query would have to be a bool query comparing with all those different fields.



                          The final score of documents is composed of the match value for each one. Those matches are also called signals, signalling that there is a match between the query and the document. The document with most signals matching gets the highest score.



                          In your case all documents would match the ngram HI. But only the HITS FIND SOME and the HITS OTHER document would get the edgengram additional score. This would give those two documents a boost and put them on top. The complication with this is that you have to make sure that the edgengram doesn't split on whitespaces, because then the HI at the end would get the same score as in the beginning of the document.



                          Here is an example mapping and query for your case:



                          PUT /tag/

                          "settings":
                          "analysis":
                          "analyzer":
                          "edge_analyzer":
                          "tokenizer": "edge_tokenizer"
                          ,
                          "kw_analyzer":
                          "tokenizer": "kw_tokenizer"
                          ,
                          "ngram_analyzer":
                          "tokenizer": "ngram_tokenizer"
                          ,
                          "autocomplete_analyzer":
                          "tokenizer": "autocomplete_tokenizer",
                          "filter": [
                          "standard"
                          ]
                          ,
                          "autocomplete_search":
                          "tokenizer": "whitespace"

                          ,
                          "tokenizer":
                          "kw_tokenizer":
                          "type": "keyword"
                          ,
                          "edge_tokenizer":
                          "type": "edge_ngram",
                          "min_gram": 2,
                          "max_gram": 10
                          ,
                          "ngram_tokenizer":
                          "type": "ngram",
                          "min_gram": 2,
                          "max_gram": 10,
                          "token_chars": [
                          "letter",
                          "digit"
                          ]
                          ,
                          "autocomplete_tokenizer":
                          "type": "edge_ngram",
                          "min_gram": 1,
                          "max_gram": 10,
                          "token_chars": [
                          "letter",
                          "symbol"
                          ]



                          ,
                          "mappings":
                          "tag":
                          "properties":
                          "id":
                          "type": "long"
                          ,
                          "name":
                          "type": "text",
                          "fields":
                          "edge":
                          "type": "text",
                          "analyzer": "edge_analyzer"
                          ,
                          "ngram":
                          "type": "text",
                          "analyzer": "ngram_analyzer"









                          And a query:



                          POST /tag/_search

                          "query":
                          "bool":
                          "should": [

                          "function_score":
                          "query":
                          "match":
                          "name.edge":
                          "query": "HI"


                          ,
                          "boost": "5",
                          "boost_mode": "multiply"

                          ,

                          "match":
                          "name.ngram":
                          "query": "HI"


                          ,

                          "match":
                          "name":
                          "query": "HI"



                          ]









                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Nov 19 at 13:53

























                          answered Nov 19 at 13:18









                          paweloque

                          8,9232069116




                          8,9232069116



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238598%2felasticsearch-edge-ngram-tokenizer-higher-score-when-word-begins-with-n-gram%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Top Tejano songwriter Luis Silva dead of heart attack at 64

                              政党

                              天津地下鉄3号線