Python Read Data From Multiple Urls and Scrape on Them

Web Scraping, Part 2¶

You take installed BeautifulSoup (bs4) and tried some basic scraping.

If yous accept not yet installed the Requests module, practice it at present (in your virtual environment).

If y'all have not fabricated a virtual environment even so, see these instructions.

The code for this chapter is here.

Using `select()` instead of `notice()` or `find_all()` ¶

In the previous department nosotros covered several normally used commands for scraping with BeautifulSoup:

                            soup              .              h1              .              text              soup              .              find_all              (              "td"              ,              class_              =              "city"              )              soup              .              find_all              (              "img"              )              soup              .              find              (              id              =              "telephone call"              )

In chapter 12 of Automate the Boring Stuff with Python (2d edition), the author covers some other command, the select() method. More info: Read the docs.

This method might concord special entreatment to people used to working with JavaScript, considering the syntax for targeting HTML elements — inside the parentheses of select() — follows the same syntax equally this commonly used JavaScript method:

                            document              .              querySelectorAll              ()

So instead of ( "td", class_="city" ) , we would write ( "td.city" ) , and instead of (id="telephone call") , we would write ("#telephone call") .

Note that select() always returns a list, even when but one item is institute.

Exist mindful that the way y'all write out what y'all're looking for depends on whether you lot are calling select() or you are calling find() or find_all() . You'll get errors if you mix up the syntax.

Working with lists of Tag objects¶

Both find_all() and select() always render a Python listing. Each particular in the list is a BeautifulSoup Tag object. You tin can access whatsoever list item using its index — but every bit you would with any normal Python listing.

Attempt this lawmaking in the Python shell:

                            from              bs4              import              BeautifulSoup              import              requests              url              =              "https://weimergeeks.com/examples/scraping/example1.html"              page              =              requests              .              get              (              url              )              soup              =              BeautifulSoup              (              page              .              text              ,              'html.parser'              )              images              =              soup              .              select              (              'img'              )              print              (              images              )

You'll run across that you lot have a Python list of IMG elements.

Y'all can phone call .get_text() or .text on a Tag object to get but the text inside the chemical element. To become the text from just i Tag object in a list, use its list index:

                            cities              =              soup              .              select              (              'td.city'              )              impress              (              cities              [              0              ])              print              (              cities              [              0              ]              .              text              )

To get the text from all the items in the list, you demand a for-loop:

                            for              city              in              cities              :              impress              (              city              .              text              )

If an element has attributes, you can get a Python dictionary containing all of them — again, use an index to see merely 1 detail from the listing:

                            images              =              soup              .              select              (              'img'              )              print              (              images              [              0              ]              .              attrs              )

Example, running the commands in a higher place in the Python shell:

                            >>>                            print              (              images              [              0              ]              )              <img alt="thumbnail" src="images/thumbnails/park_structures.jpg"/>              >>>                            print              (              images              [              0              ]              .              attrs              )              {'src': 'images/thumbnails/park_structures.jpg', 'alt': 'thumbnail'}

To get a particular attribute for all the IMG elements, you need a for-loop:

                            for              epitome              in              images              :              print              (              image              .              attrs              [              'src'              ]              )

Again, here's how that would run in the Python trounce:

                            >>>                            for              image              in              images              :              ...                            print              (              image              .              attrs              [              'src'              ]              )              ...              images/thumbnails/park_structures.jpg              images/thumbnails/building.jpg              images/thumbnails/mosque.jpg              images/thumbnails/turrets.jpg              images/thumbnails/russia.jpg              >>>

View the example web page to become a articulate thought of where those attributes came from:

https://weimergeeks.com/examples/scraping/example1.html

Another way to get a particular attribute from a Tag object is with .become() :

                            for              image              in              images              :              print              (              image              .              become              (              'src'              )              )

As you lot see, there are diverse ways to do the same thing with BeautifulSoup. If y'all find it confusing, choose 1 way and stick with it.

When in doubtfulness, refer to the BeautifulSoup documentation — it's all on one folio, so search it with Control-F.

Finding inside a Tag object¶

The methods find() , find_all() , and select() work on Tag objects besides as BeautifulSoup objects (types of objects are covered here). Here is an instance:

../python_code_examples/scraping/table_scrape.py¶

                                                                                                                                                  from                          bs4                          import                          BeautifulSoup                          import                          requests                          url                          =                          "https://en.wikipedia.org/wiki/List_of_Scottish_monarchs"                          folio                          =                          requests                          .                          get                          (                          url                          )                          soup                          =                          BeautifulSoup                          (                          folio                          .                          text                          ,                          'html.parser'                          )                          # become the beginning table in the article                          table                          =                          soup                          .                          discover                          (                          'table'                          ,                          class_                          =                          'wikitable'                          )                          # get a list of all rows in that table                          rows                          =                          table                          .                          find_all                          (                          'tr'                          )                          # loop over all rows, become all cells                          for                          row                          in                          rows                          :                          attempt                          :                          cells                          =                          row                          .                          find_all                          (                          'td'                          )                          # print contents of the second prison cell in the row                          impress                          (                          cells                          [                          one                          ]                          .                          text                          )                          except                          :                          pass                                              
                  
                          

Once we've got table out of soup (line 9 higher up), we can proceed to find elements inside the Tag object table . First nosotros get a list of all rows (line 12). Then we can loop over the listing of row objects (starting on line 15) and brand a list of all tabular array cells in each row (line 17). From that listing, we tin extract the contents of one or more cells. In the for-loop, past printing cells[1].text (line 19), we will see a list of all Scottish monarchs in the start table on the page.

It's as if nosotros are taking autonomously a set of nested boxes. We become inside the table to become the rows. We get inside a row to become its cells.

Annotation

Using effort / except in the script above enables us to skip over the header row of the tabular array, where the HTML tags are thursday instead of td .

Since the Scottish monarchs folio has multiple tables, the lawmaking higher up should be modified to get them all:

                            tables              =              soup              .              find_all              (              'table'              ,              class_              =              'wikitable'              )

And then nosotros will need to loop through the tables:

                            for              table              in              tables              :              rows              =              tabular array              .              find_all              (              'tr'              )              for              row              in              rows              :              attempt              :              cells              =              row              .              find_all              (              'td'              )              # impress contents of the second jail cell in the row              print              (              cells              [              ane              ]              .              text              )              except              :              pass

Our prepare of nested boxes actually begins with the folio. Within the page are several tables. Inside each tabular array, we find rows, and within each row, we detect cells. Within the second cell in each row, we notice the proper noun of a male monarch.

Note

The revised script will not work perfectly on the Scottish monarchs page because the tables in that folio are non formatted consistently. The first table on the page has the monarch's name in the second column, but the other tables have it in the first column.

Moving from page to page while scraping¶

In chapter 12 of Automate the Boring Stuff with Python (second edition), Sweigart provides a script to scrape the XKCD comics website ("Projection: Downloading All XKCD Comics"). The code in steps 3 and 4, which are part of a longer while-loop, get the URL from an element on the folio that links to the previous comic. In this way, the script starts on the dwelling page of the site, downloads 1 comic, so moves to the previous mean solar day's comic page and downloads the comic there. The script repeats this, moving to the previous page each fourth dimension, until all comics have been downloaded.

Note

This method is often exactly what y'all need to scrape the information that y'all want.

The trick is to determine exactly how to become the URL that leads to the side by side page to exist scraped.

In the instance of the XKCD site, this code works:

                            prevLink              =              soup              .              select              (              'a[rel="prev"]'              )[              0              ]              url              =              'https://xkcd.com'              +              prevLink              .              get              (              'href'              )

The code select('a[rel="prev"]') gets all a elements on the folio that contain the attribute rel with the value "prev" — that is, rel="prev" . This code returns a list, then it'due south necessary to use the list alphabetize [0] to become the commencement list item.

The next line extracts the value of the href aspect from that A chemical element and concatenates it with the base URL, https://xkcd.com .

If you audit the HTML on any XKCD page with Developer Tools, you can find this A chemical element.

Developer Tools and an XKCD web page screenshot

To understand this code better, you lot can run it in the Python beat out. Here I have started on the page at https://xkcd.com/2260/ —

                            >>>                            from              bs4              import              BeautifulSoup              >>>                            import              requests              >>>                            url              =              'https://xkcd.com/2260/'              >>>                            page              =              requests              .              get              (              url              )              >>>                            soup              =              BeautifulSoup              (              page              .              text              ,              'html.parser'              )

Annotation

I am not starting at the habitation page and I am non looping, because I desire to provide a elementary demonstration of what the code is getting for usa.

Then I continued with the code to become only the "Prev" lawmaking from that ane page:

                            >>>                            prevLink              =              soup              .              select              (              'a[rel="prev"]'              )[              0              ]              >>>                            print              (              prevLink              )              <a accesskey="p" href="/2259/" rel="prev">&lt; Prev</a>              >>>                            print              (              prevLink              .              go              (              'href'              )              )              /2259/              >>>                            url              =              'https://xkcd.com'              +              prevLink              .              become              (              'href'              )              >>>                            print              (              url              )              https://xkcd.com/2259/              >>>

In a higher place, I have printed iii things — prevLink , prevLink.get('href') and url — then I can run into exactly what is being extracted and used.

In the complete script in chapter 12, once the script has that final URL, the while-loop restarts. It opens that page — https://xkcd.com/2259/ — and downloads the comic from it.

This practice of printing the value each time is a mode of testing your code every bit you go — to make sure you are getting what you intend to get. If you have an error, so you must change the line and print information technology again, and repeat until that line of code gets what you want.

Of import

You must empathise that every website is different, so probably no other website in the globe has the same HTML every bit the XKCD website. Notwithstanding, many websites do take Previous and Next buttons. It is necessary to audit the HTML and determine how to extract the side by side- or previous-folio URL (or partial URL) from the HTML on the button.

Some websites apply JavaScript to actuate their Previous and Side by side buttons. In those cases, you volition need to use the Selenium module to navigate while scraping. Selenium is covered in the side by side affiliate.

Moving from page to folio while scraping, Role 2¶

I accept posted two example scripts to help y'all with sites where moving from page to page involves something like this:

Paginated site links screenshot

Annotation

Ordinarily yous exercise NOT demand Selenium for these links. They are a lot similar the XKCD case discussed to a higher place.

The two scripts are:

mls_pages.py — This one uses the "Get to next page" link until there is no next folio.
mls_pages_v2 — This one uses for i in range() , for utilise if you know how many pages there are.

Annotation that the two scripts do the same thing on 1 particular website, the Players section of the Major League Soccer site. The difference is in the way each script gets the link to the adjacent folio.

Important

Remember that every website is different, then probably no other website in the world has the same HTML as the MLS website. However, many websites use a like set of links to pages.

Harvesting multiple URLs from one page¶

In some cases, you volition desire to get all the URLs from ane page and save them in a file. You would then use some other script to open them 1 past one and scrape multiple pages.

../python_code_examples/scraping/harvest_urls.py¶

                                                                                                                                                  """scrape all the URLs from the article segment of a Wikipedia page                                                      and write them into a plain-text file - gets all links - even                                                      those that begin with #                          """                          from                          bs4                          import                          BeautifulSoup                          import                          requests                          # become the contents of i page                          commencement                          =                          'http://en.wikipedia.org/wiki/Harrison_Ford'                          folio                          =                          requests                          .                          get                          (                          start                          )                          soup                          =                          BeautifulSoup                          (                          page                          .                          text                          ,                          'html.parser'                          )                          # name the text file that will exist created or overwritten                          filename                          =                          'myfile.txt'                          def                          capture_urls                          (                          filename                          ,                          soup                          ):                          """harvest the URLs and write them to file"""                          # create and open the file for writing - notation, with 'westward' this volition                          # delete all contents of the file if information technology already exists                          myfile                          =                          open up                          (                          filename                          ,                          'w'                          )                          # go all contents of simply the article                          article                          =                          soup                          .                          find                          (                          id                          =                          'mw-content-text'                          )                          # go all <a> elements                          links_list                          =                          article                          .                          find_all                          (                          'a'                          )                          # get contents of all href='' attributes with a loop                          for                          link                          in                          links_list                          :                          if                          'href'                          in                          link                          .                          attrs                          :                          # write one href into the text file - '\n' is newline                          myfile                          .                          write                          (                          link                          .                          attrs                          [                          'href'                          ]                          +                          '                          \n                          '                          )                          # close and salvage the file subsequently loop ends                          myfile                          .                          close                          ()                          # call the function                          capture_urls                          (                          filename                          ,                          soup                          )                                              
                  
                          

It is likely that yous practise not want header and footer links from the page. You demand to inspect the HTML and ascertain what chemical element holds the main text. For a Wikipedia commodity, in that location's an id attribute with the value 'mw-content-text' , so that'southward what we commencement with in line 24.

When we get all the a elements with links_list = commodity.find_all('a') (line 27), nosotros are getting only the a elements that are inside the DIV chemical element with id='mw-content-text' — because the variable article here is a Tag object containing that entire DIV.

Then nosotros use a loop (lines xxx–33) to look at each item in links_list . We cheque if an href aspect exists in the detail with this line — if 'href' in link.attrs: — and if there is an HREF, then we write the value of that HREF into the file (line 33).

The script to a higher place writes more than than one,400 fractional URLs into a file.

Every bit with the XKCD script in the previous section, here we would also concatenate a base URL with the fractional URL in a scraping script:

                            base_url              =              'https://en.wikipedia.org'              url              =              base_url              +              '/wiki/Blade_Runner'

Information technology should be noted that the harvest_urls.py script collects a lot of partial URLs we would never want, such as internal anchors that link to parts of the same page — #cite_note-1 and #Early_career , for example. To forbid those from beingness written to the file, we could use:

                            for              link              in              links_list              :              if              'href'              in              link              .              attrs              :              # eliminate internal ballast links              if              link              .              attrs              [              'href'              ][:              6              ]              ==              '/wiki/'              :              # eliminate Wikipedia photograph and template links              if              link              .              attrs              [              'href'              ][              6              :              xi              ]              !=              'File:'              and              link              .              attrs              [              'href'              ][              6              :              14              ]              !=              'Template'              :              # write one href into the text file - \northward is newline              myfile              .              write              (              link              .              attrs              [              'href'              ]              +              '              \n              '              )

That is a bit clunky, just if you expect up how to slice strings with Python (Control-F search there for "Slice indices"), I think the code will make sense to you.

As a effect, we have nearly 900 fractional URLs instead of more than 1,400.

Of import

You will always need to inspect the HTML of a page to effigy out how all-time to harvest URLs from that item page.

Scrape multiple pages with one script¶

This example shows how you can scrape multiple items from multiple pages, non using a Previous and Next button but (instead) using a nerveless list of partial URLs.

../python_code_examples/scraping/scrape_several_pages.py¶

                                                                                                                                                  """scrape a heading and role of a paragraph from multiple pages,                                                      using a listing of partial URLs                          """                          from                          bs4                          import                          BeautifulSoup                          import                          requests                          # links from http://en.wikipedia.org/wiki/Harrison_Ford                          link_list                          =                          [                          '/wiki/Melissa_Mathison'                          ,                          '/wiki/Calista_Flockhart'                          ,                          '/wiki/Han_Solo'                          ,                          '/wiki/Star_Wars_Trilogy'                          ,                          '/wiki/Indiana_Jones'                          ,                          '/wiki/Air_Force_One_(film)'                          ,                          '/wiki/Blade_Runner'                          ,                          '/wiki/Carrie_Fisher'                          ]                          # harvest data from each URL                          def                          get_info                          (                          page_url                          ):                          page                          =                          requests                          .                          get                          (                          'https://en.wikipedia.org'                          +                          page_url                          )                          soup                          =                          BeautifulSoup                          (                          folio                          .                          text                          ,                          'html.parser'                          )                          try                          :                          print                          (                          soup                          .                          h1                          .                          get_text                          ())                          # become all paragraphs in the main article                          paragraphs                          =                          soup                          .                          find                          (                          id                          =                          'mw-content-text'                          )                          .                          find_all                          (                          'p'                          )                          for                          p                          in                          paragraphs                          :                          # skip any paragraph that has attributes                          if                          not                          p                          .                          attrs                          :                          # print 280 characters from the starting time real paragraph on the page                          print                          (                          p                          .                          text                          [                          0                          :                          280                          ]                          )                          suspension                          print                          ()                          # blank line                          except                          :                          impress                          (                          page_url                          +                          ' is missing something!'                          )                          # call the office for each URL in the list                          for                          link                          in                          link_list                          :                          get_info                          (                          link                          )                                              
                  
                          

Nosotros are just printing the H1 and the paragraph (rather than saving them to a file or a database) for the sake of simplicity. We are using a listing of only eight partial URLs for the aforementioned reason; normally you would probably have a longer list of pages to scrape.

The key is to write a function that scrapes all the data you want from 1 page (lines 19–35 above). Then telephone call that function inside a for-loop that feeds each URL into the part (lines 38–39).

Moving onward¶

In the next chapter, nosotros'll expect at how to handle more than complex scraping situations with BeautifulSoup, Requests, and Selenium.

.

Brashear Hessium2000

Python Read Data From Multiple Urls and Scrape on Them

Web Scraping, Part 2¶

Using `select()` instead of `notice()` or `find_all()` ¶

Working with lists of Tag objects¶

Finding inside a Tag object¶

Moving from page to page while scraping¶

Moving from page to folio while scraping, Role 2¶

Harvesting multiple URLs from one page¶

Scrape multiple pages with one script¶

Moving onward¶

Post a Comment for "Python Read Data From Multiple Urls and Scrape on Them"

Python Read Data From Multiple Urls and Scrape on Them

Web Scraping, Part 2¶

Using select() instead of notice() or find_all() ¶

Working with lists of Tag objects¶

Finding inside a Tag object¶

Moving from page to page while scraping¶

Moving from page to folio while scraping, Role 2¶

Harvesting multiple URLs from one page¶

Scrape multiple pages with one script¶

Moving onward¶

Post a Comment for "Python Read Data From Multiple Urls and Scrape on Them"

Using `select()` instead of `notice()` or `find_all()` ¶