In addition to local data for utilization there are large quantities of data generated over the Internet every day inside which there exist many contents with great values You might wonder whether it is troublesome to acquire those network data Indeed, nope Python provides many functions to conveniently acquire network data However, contents on some web pages may be dynamically generated say, dynamically generated by javascript At this time, the source codes of webpage do not correspond to the contents displayed on the web pages like the data and information of Dow Jones Industrial Average stocks on Yahoo Finance Data on this website often change In Python, how can we acquire network data Acquisition of network data is also known as "crawl" divided into two stages Stage One is to scrape and Stage Two is to parse For scraping, we used to adopt such a built module "urllib" especially its "request" module which can conveniently scrape web page contents This module has been gradually replaced with the Requests third party library We'll mention the Requests third party library later It is suitable for development of medium-sized and small web crawlers For developing large web crawlers it's often to use Scrapy which is a highly popular open-source crawler framework Let's look at a concept here The process of scraping is actually the client computer sends a request to the server and the server will return a response After getting a response we'll parse it At present, popular parser tools include the Beautiful Soup library and the regular expression module We'll briefly introduce them later Apart from those scraping and parsing modes sometimes we may turn to a third party API to more conveniently scrape and parse webpage contents In this way, the third party API has automatically scraped and parsed for us with regular updating Let's look at the popular Requests library for webpage scraping first Compared with many previous libraries this library is indeed simple, convenient and human-oriented Its basic method is "get()" corresponding to the "get" method in the http protocol which may request and acquire the resources at designated url locations Well let's look at its official website first This is rightly the official website of Requests library It's worth noticing that we should get used to using some officially-provided help information For example, when we're under the Python environment before we would often use "dir()" and "help()" As for third party libraries like Requests their official websites generally provide quite rich information supporting multiple languages It will be OK if we rely on its website Well, let's have a look This is a brief introduction to the Requests library The Requests library is an elegant and simple HTTP Python library created for the human beings Is that interesting Let's feel the charm of the Requests library Quite easy. Use the "get()" method followed by the webpage address to be scraped then a response object is acquired next, view the status code with "status_code" If it is 200, it means everything is OK then, view the webpage contents through the "text" attributes These statements are most common Is it easy, right? Let's give it a go Suppose we're to scrape the short comments on a certain page of the Little Prince at book.douban.com what we're going to scrape is the source codes of webpage Not difficult Through the "get()" method in the Requests library, we can scrape them(add headers argument after website updated, see the slides or programs) It's worth mentioning that before scraping webpage of a website check whether this website has any crawler protocol For instance, the crawler protocol of Douban website is like this How to view a crawler protocol Some websites provide such a file, robots.txt If so, it means the website has its own crawler protocol For example, let's have a look we're now scraping the directory "subject" It's not prohibited here So we can scrape it Sure, if you want to scrape several pages pay attention to its delay here, say, it's 5 seconds OK, let's have a try The Requests library has been preloaded in Anaconda we can directly import it If not installed do not worry install it through "pip install" then, use its "get()" method to put the url of our requested webpage(add headers argument) It's scraping now Alright, check the status code 200. Everything’s OK That's the scraped page contents Quite easy, right However, contents on some web pages may be dynamically generated When directly scraped contents of these webpages may not include our desired data For instance, we'd like to scrape the basic information of Dow Jones Industrial Average stocks In its source code, as we see, a lot of our desired data can not be acquired through directly scraping this url but dynamically generated by js Of course, website generation modes are not always the same Some data, say can be quite easily acquired sometimes We can look for some non-dynamically generated pages for scraping or acquire data with the API we'll talk about later There're also some existing data sets or data directly saved online For conveniently acquiring the basic information of Dow Jones Industrial Average stocks like company code, long name and its closing price etc we can utilize such a website, similar to CNN Sure, please note that the modes of data generation of this website may also be changed With url, we can use the "get()" method in the Requests library to send a request and then acquire a response object This object includes the request information and the server’s response information And Requests will automatically decode information from the server Suppose a webpage is in the "json" format then we can utilize the embedded "json" decoder in the Requests library for decoding In a similar way like this Suppose the response contents are in a binary form then we can use the "content" attribute to decode like, returning a picture created with binary data Besides, we also introduced the more common "text" attribute It may automatically guess the text codes and decode them Sure, we may also use the "encoding" attribute to modify text codes The frequently used one is utf-8 Such webpages contents we've acquired in this way if parsed in a way we'll introduce soon can produce the real data inside For example, if we take a part of it the effect is like this When introducing the dictionary later we'll talk about how to use the arguments in the "get()" method in the Requests library to submit inquires at the keyword inquire interface of search engines So far we have got a little skill of scraping webpages For more complex inquires like submitting forms we still need the "post()" method After acquiring the source codes of webpages we need to parse the data contents of webpage such as the short comments on the Little Prince, which we introduced just now All its comments are marked with this tag group Such a data parsing is quite suitable for a library like Beautiful Soup Beautiful Soup is an HTML and XML parser It can very conveniently extract data from this kind of files Suppose in another situation Let's see this is the score of a comment As we can see this data structure is relatively complex To extract this kind of data with complex details it is more suitable to adopt the regular expression module Let's look at Beautiful Soup first This is the official website of Beautiful Soup Let's have a look You might have found it is similar to the Requests library Its official website provides abundant contents as well as some instances for beginners supporting multiple languages Let's look at it briefly For using Beautiful Soup import the library first next, use the BeautifulSoup() function to create a BeautifulSoup() object Here, you may send in a string or a file handle Moreover, pay attention to selecting the parser A lot of parsers are available The official website also compare them in details For HMTL, the most frequently used parser is lxml This parser operates at a high speed with strong capacity of document fault tolerance We often recommend it Let's have some examples Let's look at Beautiful Soup first the basic use of BeautifulSoup Since Anaconda has been preloaded with the BeautifulSoup library we only need to directly import it Use the upper case please Then, we define a string use the BeautifulSoup() function to send in this string generate a BeautifulSoup object "soup" It's worth noticing that there are four BeautifulSoup objects Tag, NavigableString, BeautifulSoup, and Comment Tag is just the tag in HTML or XML documents like this providing some decorations for text contents And BeautifulSoup in most cases it is indeed Tag so we can treat them as Tag objects in most cases NavigableString is indeed strings in Tag like the Little Prince here while comment is a sub-class of NavigableString OK, let's go on All tag contents can be acquired through BeautifulSoup object.Tag For example, we can acquire the contents of Tag b Check its type It's a tag The most essential attributes of tag object include "name" and "attribute" Each tag acquires its name through the "name" attribute Say, here, its name is "p" A tag may include many attributes We can acquire corresponding attributes through these attribute names Additionally, operations of tag attributes are the same as the dictionary we'll talk about later So, we can acquire attributes in such a way Also, let's have a look NavigableString objects can be expressed with the "string" attribute For instance, do we get the non-attribute string in tag Let's check its type It is NavigableString It will turn out to be very common in our later sections In Beautiful Soup, there is another common method: find_all() For instance, if we write like this we can find all Tag b contents If we only need to find the first tag contents we can sue the find() method Sure, apart from using tag as an argument, the find_all() method can also be followed by the attribute name Next, let's have a look at the task of parsing book reviews of the Little Prince we introduced before Let's see how can we use Beautiful Soup to parse its contents Let's look at the program Use the object we got before as an argument Send it into the BeautifulSoup() function We get a BeautifulSoup object: "soup" Then, use the find_all() method to find the line of the review As we saw before the characteristic of review line is Tag span The attribute content is short The find_all() method returns a list We traverse the list We'll talk about this mode soon Have a brief look For each item in this list we only need to output the string attribute of this object and we'll get the string Execute this program Have we acquired those short comments After acquiring the comments, let's look at how to get such scores As we mentioned before such a way of acquiring details is more appropriate to adopt regular expression What is a regular expression It's often used to retrieve and replace a text complying with certain rules or modules For example, 0 to 9 means any number between them . means any characters except the newline character * means repetition for 0 or several times Adding parentheses means to group They are also quite complex So we just briefly introduce them here Only one example here In webpage parsing there is one form very very frequently used Suppose we're to find a string like this followed by a string at the back a string in the middle Then, very simply, we can express this string as (.*?) and then use the compile() method to compile the string into a pattern instance Have a try The execution of compiling a string into a pattern instance is fast So we often use it Then, we use the most frequently used findall() function in the regular expression module to match such a model in the source code The return result is a list p Then, calculate the sum of scores It is originally a string so we need to convert it into the "int" type first and then output the sum OK, let's run it Finally, we see the sum is 680 We might have a look at the value of p It is a list Scores are all here This example has realizes the scraping of a single webpage and parsing of some webpage contents We can also repeatedly scrape many webpages based on the characteristics of url of these webpages In such a way, a small crawler is created So far, we have mastered the basics of developing a small web crawler Let's look back Use the Requests library to scrape webpages and use the BeautifulSoup library and the regular expression module to parse webpage contents You can practice them more in future and improve your skills But do not just focus on Douban When scraping webpages, do follow the crawler protocol In such a way, a small crawler is created So far, we have mastered the basics of developing a small web crawler Let's look back use the Requests library to scrape webpages and use the BeautifulSoup library and the regular expression module to parse webpage contents You can practice them more in future and improve your skills, of course You can try it in some websites When scraping webpages Do follow the crawler protocol