書籍分析(網路爬蟲+資料分析) + TA murmur

6 min readOct 27, 2023

這是我當台大的Python人工智慧程式設計入門的第二個教案，第一個是最簡單入門的part(week2)，就是只有hello world，if else，迴圈那類的基礎東西，題目也只是最簡單的，但是剛剛我做好了week9的教案 (的一半後面還要再帶到一些有關於更多字串處理的部分)，主要就是把之前六次上課(第一周沒有推進度，第五周放假)學的東西做一個統整，包含網路爬蟲與資料分析。那開始囉!

Case study: Best Book Ever

In this study, we will get a list of the most popular books (or other kinds of book, if you want) from a demonstration website.

Step 1: Go the the website here:

https://www.goodreads.com/list/:

In this website, we can see plenty of book list. You can choose any book list you like. For example, I choose “the best book ever” list. You can go and take a close look.

https://www.goodreads.com/list/show/1.Best_Books_Ever

Step 2: Use the package “request” to visit the website, and get all of the content in the website.

Hint: (I suppose you use Google Search. It is hard to implement on Safari).

If you are curious about what the “request” function get, you can go to any website you like (for example, the book list website given above), and press F12 or right click -> develop mode(the last button in the block) -> Elements.

You can see the html from the window you call out. In most browsers, this will contain several tools that permit you to do things like:

1. inspect the code, identify the elements and their styles. That website is composed of all of the code here.

2. see loaded resources, their file size, where they loaded from and how long they took to load,

3. check performance, including memory consumption and rendering

It will help you to understand what we are going on in the code below. For example, you can check where a sentence of picture is by selecting them, and press F12:

And you can even change the content in this website (this kind of modification will only be seen in your computer, and it will vanish whenever refreshed or close the website. Just for fun!)

If you want to learn more about these code, kindly visit these website:

(English) https://www.w3schools.com/html/html_intro.asp

(Chinese) https://selflearningsuccess.com/html-tags/

Let’s get down to business. You can get the content based on the code last week. Try it!

Step 3: Use “BeautifulSoup” package to parse the content:

After obtaining the web page source code through requests, Beautiful Soup needs a second “parser” parameter to convert the “plain text” of the source code into a “tag tree” that can be used for analysis. Python itself has built-in For the “html.parser” parser, you can also use the following command and install the “html5lib” parser (not mentioned in this course). Let use html.parser to change the plain text code into something analysable.

soup = BeautifulSoup(response.content, ‘html.parser’)
book_list = ____ ( ______, itemtype=_______)

Please fill the blank.

In this example, we want to find all <___> tags whose itemtype attribute equals http://schema.org/Book. This means that these <___ > elements contain information about the book. You should choose the correct item in <>.

You can choose from the elements of the table below:

<p> paragraph, <a> Anchor(hyperlink), <tr> Table rows, <ol> Ordered list, <th> Table header, <ul> Unordered list, <h1> ~ <h6> Header(large to small), <div> Division, <img> Image, <span> Span(expand sth.)

You can go to the develop mode mention above to guess which one is correct, too 😊

Right now, we have all of the information we need to analyze data!

Step 4: analysis some data:

We use pandas to do analysis. First, we create an empty data table.

then, we can use “find” to find the element we want for each book.

Furthermore, let me introduce THREE common function to process string:

1. strip(): “Strip” help you to remove the unwanted character at the beginning or at the end of a string. For example:

string1 = “!?!?!?!I love you?!?!?!”
string2 = string1.strip(“?!”)
print(string2) ->  “I love you”

2. split(): “Split” function help you to modify a sentence to a list by some delimiter. For example:

string3 = “I love coding very much”
string4 = string3.split(“ ”)
print(string4) -> [“I”, “love”, “coding”, “very”, “much”]

3. Replace: “Replace” help you to replace something in a string into another thing.For example

string5 = “I,love,coding,very,much”
string6 = string5.replace(“,”, “ ”)
print(string6) -> “I love coding very much”

You need to use these three little function wisely to achieve the desired result.

Let’s practice!

data = {'Title': [],_____,_____,_____}
for ___ in ____:
    title = ___.____('span', itemprop='name').text.____
    author = ___.____(___, itemprop =___).text.____
    original_score = ___.____( ___, class_=___).___
    viewer = ______
    # You make encounter some problem here. Please try to debug by yourself!
    # the original type of number here is string, you would like to alter this into desired data type.
    score_float = float(___)
viewer_int = int(___)  # Hint: we can do something before modifying type.
    data['Title']._________
    data['Author']. _________
    data['Score']. _________
    data[‘viewer’].__________
df = pd.DataFrame(data)
print(df)

Step 5: do some data analysis!

A. Please show me the average score, medium score, max score, min score in the data. (HINT: you can show all of the above message in one line. Remember it?)

B. Please find the Top 5 books that have the best scores.

C. Please find the Top 10 Author that have the most viewer. (if a author have published more than 1 books, you should add the viewer of each book together.) You may go and find the function “groupby” and “agg” by yourself. Here is the example code:

author_stats = df.groupby('___').agg({'Score': '___', 'viewer': '___'}).reset_index().
top_rated_authors = author_stats.nlargest(10, 'Score')
top_popular_authors = author_stats.nlargest(___, ___)

D. How many author in this list publish more than 1 book?

E. Please calculate the correlation between “score” and “viewer”, and try to demonstrate the relationship between these two features. Sample code:

df.____ (method=’pearson’, min_periods=1)

The answer of B looks like:

The answer of C looks like:

The answer of D looks like this:

Final Step: Do some Graph!

Please make following graph:

Top 10 Author that have the most viewer (bar)
relationship between rating and viewers (Scatter)
Find the one who publish most books in your list, and show the scores of each books. (plot)
The result should look like those pictures below:

Final Step: Do some Graph!

Please make following graph:

1. Top 10 Author that have the most viewer (bar)

2. relationship between rating and viewers (Scatter)

3. Find the one who publish most books in your list, and show the scores of each books. (plot)

其實這個計畫還蠻爽的，因為老師太佛的原因，這門課只有50%課堂表現跟50%期末報告，而課堂表現就是在課堂裏面練習一些sample code，沒有回家作業也不用考試，回家複習的時間是0。而且他開了四門類似性質的課程，依照規定每個課程可以有兩個助教，所以她聘了八個助教，然後這四門課教的東西又一樣，所以7/8的時間我就只需要等其他人做的報告出來就可以直接現學現賣了，只有一學期三次的教案要做。整體來說，這應該是全台大最爽的TA計畫吧，又不用出作業改作業改考試，也不用承擔甚麼責任，反正到時候大家都是拿A+，最主要就是在課堂上和課堂後處理python問題跟課程問題而已。(雖然這門課課程問題一堆老師第一次上課甚麼都不知道@@) 雖然這樣說啦，但是我覺得這堂課其實大家都學不到甚麼東西，希望老師可以把這門課改成跟其他python通識類似的樣子，該學的還是要學到RRR，老師你教得真的太水太水了啦RRR這堂課可以提早一個小時多走欸RRR

最後就是這門課用英語，我意外得到蠻多口說經驗的，雖然偶爾還是很卡，但至少我越來越敢說了，甚至要去台上教線性代數(?)