BetterDocs
Home
Docs

Creation | pd.read_html()

Method:

pd.read_html(io, *, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=', ', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True, extract_links=None, dtype_backend=<no_default>, storage_options=None)

Reads HTML tables into a list of DataFrames.

Returns:

pandas.core.frame.DataFrame

Parameters:

io: (str or path or file-like)-

URL, file path, or HTML string to read.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read all tables from the URL
dfs = pd.read_html(io=url)

# Print the first DataFrame (the first table in the HTML)
print(dfs[0])
'''
Output: 
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
'''

match: (str or regex), Optional-

Regex pattern to match tables (default: '.+').

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables, filtering tables containing "Sales"
dfs = pd.read_html(io=url, match="Sales")

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
   Year  Sales  Revenue
0  2021    500     1000
1  2022    600     1200
'''

flavor: ('bs4' or 'lxml' or 'html5lib'), Optional-

Parsing engine.

import pandas as pd

# URL or HTML content
url = 'https://betterdocs.tech/global/python/pandas/tables.html'

# Using BeautifulSoup (default, if lxml isn't installed)
dfs_bs4 = pd.read_html(url, flavor='bs4')

# Using lxml
dfs_lxml = pd.read_html(url, flavor='lxml')

# Using html5lib
dfs_html5lib = pd.read_html(url, flavor='html5lib')

# Output the first DataFrame from each flavor
print(dfs_bs4[0])  # First DataFrame from BeautifulSoup
print(dfs_lxml[0])  # First DataFrame from lxml
print(dfs_html5lib[0])  # First DataFrame from html5lib
'''
Output:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
'''

"lxml" is best for performance, particularly for large files.

"bs4" is useful for handling poorly formatted or complex HTML.

"html5lib" is used when dealing with HTML5 documents and when compatibility with modern web standards is a priority.

index_col: None, Optional-

It specifies which column should be used as the index of the resulting DataFrame.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, match="Expenditure", index_col="Year")

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
      Expenditure  Profit
Year                     
2021          200     800
2022          300     900
'''

skiprows: None, Optional-

Rows to skip at the start.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, skiprows=1)

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
     Alice  25
0      Bob  30
1  Charlie  22
'''

attrs: None, Optional-

is used to filter HTML tags based on their attributes when reading HTML tables from a file, URL, or string.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, attrs={'id': 'finance-table'})

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
   Year  Expenditure  Profit
0  2021          200     800
1  2022          300     900
'''

parse_dates: (True or False), Optional-

It is used to specify which columns should be parsed as dates during the reading of the table.

parse_dates = False (default) +

parse_dates = True +

thousands: single char, Optional-

Character acting as the thousands separator in numerical values.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, attrs={'class': 'numbers-class'}, thousands='|')

# Print the first matched DataFrame
print(dfs[0]['Thousands'])
'''
Output: 
0    123908213
1        78234
2         1023
Name: Thousands, dtype: int64
'''

encoding: str, Optional-

Decimal separator character.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, encoding='utf-8')

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
'''

decimal: single char, Optional-

Decimal separator character.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, attrs={'class': 'numbers-class'}, decimal=',')

# Print the first matched DataFrame
print(dfs[0]['Decimal'])
'''
Output: 
0    12021
1       12
2      921
Name: Decimal, dtype: int64
'''

converters: None, Optional-

It allows you to specify custom functions to convert or process values in certain columns while reading the file.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

def half_the_age(age):
    return int(age)/2

# Read HTML tables
dfs = pd.read_html(io=url, converters={'Age': half_the_age})

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
      Name   Age
0    Alice  12.5
1      Bob  15.0
2  Charlie  11.0
'''

na_values: None, Optional-

It is used to specify additional values that should be treated as NaN (Not a Number) while reading the data.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, match="Position", na_values=["null"])

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
    Name   Age  Position
0  Alice   NaN   Manager
1    Bob  30.0  Engineer
'''

Values: +

keep_default_na: (True or False), Optional-

It controls whether or not the default missing values (i.e., NaN values) specified by pandas should be preserved when reading the file.

keep_default_na = True (default) +

keep_default_na = False +

Values: +

If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.

If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.

If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.

If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

displayed_only: (True or False), Optional-

It allows you to filter out hidden tables from being parsed, ensuring that only the tables that are actually shown on the page are processed into DataFrames.

displayed_only = True (default) +

displayed_only = False +

Values: +

If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.

If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.

If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.

If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

dtype_backend: None, Optional-

The dtype_backend parameter is new in Pandas 2.0 which is used to specify the backend for handling the types of data when reading a file.

storage_options: dict, Optional-

Dictionary of storage-specific options, such as credentials for cloud storage.


Logo

BetterDocs

Support

EmailDiscordForms

Documentations

Python

Company

AboutDocs

Policies

Terms of ServicePrivacy Policy