Beautiful Soup [Python] and the extracting of text in a table

i am new to Python and to Beatiful Soup also! I heard about BS. It is told to be a great tool to parse and extract content. So here i am...:

I want to take the content of the first td of a table in a html document. For example, i have this table

<table> <tr> <td> This is a sample text </td> <td> This is the second sample text </td> </tr>
</table>

How can i use beautifulsoup to take the text "This is a sample text"? I use soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get the whole table.

Thanks... or should i try to get the whole stuff with Perl ... which i am not so familiar with. Another soltion would be a regex in PHP.

See the target [1]:

Note; since the html is a bit invalid - i think that we have to do some cleaning. That can cause a lot of PHP code - since we want to solve the job in PHP. Perl would be a good solution too.

Many thanks for some hints and ideas for a starting point zero

3 Answers

First find the table (as you are doing). Using find rather than findall returns the first item in the list (rather than returning a list of all finds - in which case we'd have to add an extra [0] to take the first element of the list):

table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})

Then use find again to find the first td:

first_td = table.find('td')

Then use renderContents() to extract the textual contents:

text = first_td.renderContents()

... and the job is done (though you may also want to use strip() to remove leading and trailing spaces:

trimmed_text = text.strip()

This should give:

>>> print trimmed_text
This is a sample text
>>>

as desired.

Use "text" to get text between "td"

1) First read table DOM using tag or ID

soup = BeautifulSoup(self.driver.page_source, "html.parser")
htnm_migration_table = soup.find("table", {'id':'htnm_migration_table'})

2) Read tbody

tbody = htnm_migration_table.find('tbody')

3) Read all tr from tbody tag

trs = tbody.find_all('tr')

4) get all tds using tr

for tr in trs: tds = tr.find_all('td') for td in tds: print(td.text)

I find Beautiful Soup very efficient tool so keep learning it :-) It is able to parse a page with invalid markup so it should be able to handle the page you refer. You may want to use command BeautifulSoup(html).prettify() command if you want to get a valid reformatted page source with valid markup.

As for your question, the result of your first soup.findAll(...) command is also a Beautiful Soup object and you can make a second search in it, like this:

table_soup = soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'})
your_sample_text = table_soup.find("td").renderContents().strip()
print your_sample_text

Velvet Star Monitor

Beautiful Soup [Python] and the extracting of text in a table

3 Answers

Your Answer

Sign up or log in

Post as a guest

Similar Journal

What is the strongest fixed location equipment you can obtain at Level 1?

What's the best strategy to keep the chaos low?

Where do you find slime chunks?

How should I distribute my bonus points with my Paladin at level up?