Scrape webpage for dataset










3














I am trying to scrape the contents of a webpage so I can check whether a
Stata dataset exists.



I have put together a few lines of code but they don't work:



tempfile page
copy "https://www.stata-press.com/data/r15/u.html" "`page'"
tempname fh
file open `fh' using "`page'", read
file read `fh' line
while r(eof)==0
if "`line'"=="regsmpl.dta" dis "Dataset exists"
else dis "Dataset doesn't exit"
file read `fh' line

file close `fh'


Any ideas will be highly appreciated.










share|improve this question




























    3














    I am trying to scrape the contents of a webpage so I can check whether a
    Stata dataset exists.



    I have put together a few lines of code but they don't work:



    tempfile page
    copy "https://www.stata-press.com/data/r15/u.html" "`page'"
    tempname fh
    file open `fh' using "`page'", read
    file read `fh' line
    while r(eof)==0
    if "`line'"=="regsmpl.dta" dis "Dataset exists"
    else dis "Dataset doesn't exit"
    file read `fh' line

    file close `fh'


    Any ideas will be highly appreciated.










    share|improve this question


























      3












      3








      3







      I am trying to scrape the contents of a webpage so I can check whether a
      Stata dataset exists.



      I have put together a few lines of code but they don't work:



      tempfile page
      copy "https://www.stata-press.com/data/r15/u.html" "`page'"
      tempname fh
      file open `fh' using "`page'", read
      file read `fh' line
      while r(eof)==0
      if "`line'"=="regsmpl.dta" dis "Dataset exists"
      else dis "Dataset doesn't exit"
      file read `fh' line

      file close `fh'


      Any ideas will be highly appreciated.










      share|improve this question















      I am trying to scrape the contents of a webpage so I can check whether a
      Stata dataset exists.



      I have put together a few lines of code but they don't work:



      tempfile page
      copy "https://www.stata-press.com/data/r15/u.html" "`page'"
      tempname fh
      file open `fh' using "`page'", read
      file read `fh' line
      while r(eof)==0
      if "`line'"=="regsmpl.dta" dis "Dataset exists"
      else dis "Dataset doesn't exit"
      file read `fh' line

      file close `fh'


      Any ideas will be highly appreciated.







      stata






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 9 at 18:51









      Pearly Spencer

      9,899173358




      9,899173358










      asked Nov 9 at 18:35









      user10630389

      183




      183






















          1 Answer
          1






          active

          oldest

          votes


















          2














          You could first feed the entire page into a scalar variable using the fileread() function:



          local dataset regsmpl
          scalar page = fileread("https://www.stata-press.com/data/r15/u.html")


          After the scalar is successfully created, you can go about this using two approaches.



          Solution 1: Check if the dataset is mentioned in the page



          if strmatch(page, "*`dataset'.dta*") display "Page mentions dataset"
          else display "No trace of dataset in page"


          Solution 2: Check if there is an actual link pointing to the dataset



          local link = ustrregexm(page, `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
          local url = trim(ustrregexs(1))

          if "`url'" != "" display "The link is: `url'"
          else display "There is no such link"


          Your approach can also work using both strmatch() and a regular expression:



          tempname fh
          file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
          file read `fh' line

          local tag = 0
          while r(eof) == 0
          if strmatch(`"`line'"', "*regsmpl.dta*") local tag = 1
          file read `fh' line


          if `tag' == 1 display "Dataset exists"
          else display "Dataset doesn't exit"




          tempname fh
          file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
          file read `fh' line

          local tag = 0
          while r(eof) == 0
          local link = ustrregexm(`"`line'"', `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
          if `link' == 1
          local url = trim(ustrregexs(1))
          local tag = 1

          file read `fh' line


          if `tag' == 1 display "The link is: `url'"
          else display "There is no such link"





          share|improve this answer






















            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53231532%2fscrape-webpage-for-dataset%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2














            You could first feed the entire page into a scalar variable using the fileread() function:



            local dataset regsmpl
            scalar page = fileread("https://www.stata-press.com/data/r15/u.html")


            After the scalar is successfully created, you can go about this using two approaches.



            Solution 1: Check if the dataset is mentioned in the page



            if strmatch(page, "*`dataset'.dta*") display "Page mentions dataset"
            else display "No trace of dataset in page"


            Solution 2: Check if there is an actual link pointing to the dataset



            local link = ustrregexm(page, `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
            local url = trim(ustrregexs(1))

            if "`url'" != "" display "The link is: `url'"
            else display "There is no such link"


            Your approach can also work using both strmatch() and a regular expression:



            tempname fh
            file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
            file read `fh' line

            local tag = 0
            while r(eof) == 0
            if strmatch(`"`line'"', "*regsmpl.dta*") local tag = 1
            file read `fh' line


            if `tag' == 1 display "Dataset exists"
            else display "Dataset doesn't exit"




            tempname fh
            file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
            file read `fh' line

            local tag = 0
            while r(eof) == 0
            local link = ustrregexm(`"`line'"', `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
            if `link' == 1
            local url = trim(ustrregexs(1))
            local tag = 1

            file read `fh' line


            if `tag' == 1 display "The link is: `url'"
            else display "There is no such link"





            share|improve this answer



























              2














              You could first feed the entire page into a scalar variable using the fileread() function:



              local dataset regsmpl
              scalar page = fileread("https://www.stata-press.com/data/r15/u.html")


              After the scalar is successfully created, you can go about this using two approaches.



              Solution 1: Check if the dataset is mentioned in the page



              if strmatch(page, "*`dataset'.dta*") display "Page mentions dataset"
              else display "No trace of dataset in page"


              Solution 2: Check if there is an actual link pointing to the dataset



              local link = ustrregexm(page, `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
              local url = trim(ustrregexs(1))

              if "`url'" != "" display "The link is: `url'"
              else display "There is no such link"


              Your approach can also work using both strmatch() and a regular expression:



              tempname fh
              file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
              file read `fh' line

              local tag = 0
              while r(eof) == 0
              if strmatch(`"`line'"', "*regsmpl.dta*") local tag = 1
              file read `fh' line


              if `tag' == 1 display "Dataset exists"
              else display "Dataset doesn't exit"




              tempname fh
              file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
              file read `fh' line

              local tag = 0
              while r(eof) == 0
              local link = ustrregexm(`"`line'"', `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
              if `link' == 1
              local url = trim(ustrregexs(1))
              local tag = 1

              file read `fh' line


              if `tag' == 1 display "The link is: `url'"
              else display "There is no such link"





              share|improve this answer

























                2












                2








                2






                You could first feed the entire page into a scalar variable using the fileread() function:



                local dataset regsmpl
                scalar page = fileread("https://www.stata-press.com/data/r15/u.html")


                After the scalar is successfully created, you can go about this using two approaches.



                Solution 1: Check if the dataset is mentioned in the page



                if strmatch(page, "*`dataset'.dta*") display "Page mentions dataset"
                else display "No trace of dataset in page"


                Solution 2: Check if there is an actual link pointing to the dataset



                local link = ustrregexm(page, `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
                local url = trim(ustrregexs(1))

                if "`url'" != "" display "The link is: `url'"
                else display "There is no such link"


                Your approach can also work using both strmatch() and a regular expression:



                tempname fh
                file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
                file read `fh' line

                local tag = 0
                while r(eof) == 0
                if strmatch(`"`line'"', "*regsmpl.dta*") local tag = 1
                file read `fh' line


                if `tag' == 1 display "Dataset exists"
                else display "Dataset doesn't exit"




                tempname fh
                file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
                file read `fh' line

                local tag = 0
                while r(eof) == 0
                local link = ustrregexm(`"`line'"', `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
                if `link' == 1
                local url = trim(ustrregexs(1))
                local tag = 1

                file read `fh' line


                if `tag' == 1 display "The link is: `url'"
                else display "There is no such link"





                share|improve this answer














                You could first feed the entire page into a scalar variable using the fileread() function:



                local dataset regsmpl
                scalar page = fileread("https://www.stata-press.com/data/r15/u.html")


                After the scalar is successfully created, you can go about this using two approaches.



                Solution 1: Check if the dataset is mentioned in the page



                if strmatch(page, "*`dataset'.dta*") display "Page mentions dataset"
                else display "No trace of dataset in page"


                Solution 2: Check if there is an actual link pointing to the dataset



                local link = ustrregexm(page, `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
                local url = trim(ustrregexs(1))

                if "`url'" != "" display "The link is: `url'"
                else display "There is no such link"


                Your approach can also work using both strmatch() and a regular expression:



                tempname fh
                file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
                file read `fh' line

                local tag = 0
                while r(eof) == 0
                if strmatch(`"`line'"', "*regsmpl.dta*") local tag = 1
                file read `fh' line


                if `tag' == 1 display "Dataset exists"
                else display "Dataset doesn't exit"




                tempname fh
                file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
                file read `fh' line

                local tag = 0
                while r(eof) == 0
                local link = ustrregexm(`"`line'"', `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
                if `link' == 1
                local url = trim(ustrregexs(1))
                local tag = 1

                file read `fh' line


                if `tag' == 1 display "The link is: `url'"
                else display "There is no such link"






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 11 at 0:04

























                answered Nov 9 at 18:49









                Pearly Spencer

                9,899173358




                9,899173358



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid …


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid …


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53231532%2fscrape-webpage-for-dataset%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Top Tejano songwriter Luis Silva dead of heart attack at 64

                    政党

                    天津地下鉄3号線