URL, file path, or HTML string to read.
import pandas as pd
# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"
# Read all tables from the URL
dfs = pd.read_html(io=url)
# Print the first DataFrame (the first table in the HTML)
print(dfs[0])
'''
Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 22
'''
Regex pattern to match tables (default: '.+').
import pandas as pd
# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"
# Read HTML tables, filtering tables containing "Sales"
dfs = pd.read_html(io=url, match="Sales")
# Print the first matched DataFrame
print(dfs[0])
'''
Output:
Year Sales Revenue
0 2021 500 1000
1 2022 600 1200
'''
Parsing engine.
import pandas as pd
# URL or HTML content
url = 'https://betterdocs.tech/global/python/pandas/tables.html'
# Using BeautifulSoup (default, if lxml isn't installed)
dfs_bs4 = pd.read_html(url, flavor='bs4')
# Using lxml
dfs_lxml = pd.read_html(url, flavor='lxml')
# Using html5lib
dfs_html5lib = pd.read_html(url, flavor='html5lib')
# Output the first DataFrame from each flavor
print(dfs_bs4[0]) # First DataFrame from BeautifulSoup
print(dfs_lxml[0]) # First DataFrame from lxml
print(dfs_html5lib[0]) # First DataFrame from html5lib
'''
Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 22
Name Age
0 Alice 25
1 Bob 30
2 Charlie 22
Name Age
0 Alice 25
1 Bob 30
2 Charlie 22
'''
"lxml" is best for performance, particularly for large files.
"bs4" is useful for handling poorly formatted or complex HTML.
"html5lib" is used when dealing with HTML5 documents and when compatibility with modern web standards is a priority.
Row number(s) to use as the column names.
It specifies which column should be used as the index of the resulting DataFrame.
import pandas as pd
# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"
# Read HTML tables
dfs = pd.read_html(io=url, match="Expenditure", index_col="Year")
# Print the first matched DataFrame
print(dfs[0])
'''
Output:
Expenditure Profit
Year
2021 200 800
2022 300 900
'''
is used to filter HTML tags based on their attributes when reading HTML tables from a file, URL, or string.
import pandas as pd
# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"
# Read HTML tables
dfs = pd.read_html(io=url, attrs={'id': 'finance-table'})
# Print the first matched DataFrame
print(dfs[0])
'''
Output:
Year Expenditure Profit
0 2021 200 800
1 2022 300 900
'''
It is used to specify which columns should be parsed as dates during the reading of the table.
Pandas will not attempt to parse any columns as dates. If any date strings are present, they will be read as plain text.
import pandas as pd
# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"
# Read HTML tables
dfs = pd.read_html(io=url, attrs={'class': 'dates-class'}, parse_dates=False)
# Print the first matched DataFrame
print(dfs[0]['Date'])
'''
Output:
0 2003-10-02
1 2003-04-16
2 2023-12-29
Name: Date, dtype: object
'''
Pandas will attempt to parse all columns with date-like values (e.g., strings in the format "YYYY-MM-DD") into datetime objects.
import pandas as pd
# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"
# Read HTML tables
dfs = pd.read_html(io=url, attrs={'class': 'dates-class'}, parse_dates=['Date'])
# Print the first matched DataFrame
print(dfs[0]['Date'])
'''
Output:
0 2003-10-02
1 2003-04-16
2 2023-12-29
Name: Date, dtype: datetime64[ns]
'''
Character acting as the thousands separator in numerical values.
import pandas as pd
# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"
# Read HTML tables
dfs = pd.read_html(io=url, attrs={'class': 'numbers-class'}, thousands='|')
# Print the first matched DataFrame
print(dfs[0]['Thousands'])
'''
Output:
0 123908213
1 78234
2 1023
Name: Thousands, dtype: int64
'''
Decimal separator character.
Decimal separator character.
import pandas as pd
# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"
# Read HTML tables
dfs = pd.read_html(io=url, attrs={'class': 'numbers-class'}, decimal=',')
# Print the first matched DataFrame
print(dfs[0]['Decimal'])
'''
Output:
0 12021
1 12
2 921
Name: Decimal, dtype: int64
'''
It allows you to specify custom functions to convert or process values in certain columns while reading the file.
import pandas as pd
# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"
def half_the_age(age):
return int(age)/2
# Read HTML tables
dfs = pd.read_html(io=url, converters={'Age': half_the_age})
# Print the first matched DataFrame
print(dfs[0])
'''
Output:
Name Age
0 Alice 12.5
1 Bob 15.0
2 Charlie 11.0
'''
It is used to specify additional values that should be treated as NaN (Not a Number) while reading the data.
import pandas as pd
# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"
# Read HTML tables
dfs = pd.read_html(io=url, match="Position", na_values=["null"])
# Print the first matched DataFrame
print(dfs[0])
'''
Output:
Name Age Position
0 Alice NaN Manager
1 Bob 30.0 Engineer
'''
" ", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "None", "n/a", "nan", "null"
It controls whether or not the default missing values (i.e., NaN values) specified by pandas should be preserved when reading the file.
Pandas will keep the default missing value strings in the dataset (like NaN, NA, null, etc.) and convert them to NaN.
import pandas as pd
# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"
# Read HTML tables
dfs = pd.read_html(io=url, match="Position", keep_default_na=True)
# Print the first matched DataFrame
print(dfs[0])
'''
Output:
Name Age Position
0 Alice NaN Manager
1 Bob 30.0 Engineer
'''
Pandas does not interpret the default missing values (e.g., NA, null, NaN, etc.) as missing values. It will only treat the values specified in the na_values parameter as NaN.
import pandas as pd
# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"
# Read HTML tables
dfs = pd.read_html(io=url, match="Position", keep_default_na=False)
# Print the first matched DataFrame
print(dfs[0])
'''
Output:
Name Age Position
0 Alice null Manager
1 Bob 30 Engineer
'''
" ", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "None", "n/a", "nan", "null"
If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.
If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.
If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.
If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.
It allows you to filter out hidden tables from being parsed, ensuring that only the tables that are actually shown on the page are processed into DataFrames.
Only visible tables (those not hidden using CSS) are read.
All tables, including those that are hidden via CSS, will be read.
" ", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "None", "n/a", "nan", "null"
If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.
If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.
If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.
If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.
It is used to control whether to extract links from the HTML tables and include them in the resulting DataFrame.
import pandas as pd
# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables-links.html"
# Read HTML tables
dfs = pd.read_html(io=url, extract_links="all")
# Print the first matched DataFrame
print(dfs[0])
'''
Output:
(Name, None) (Link, None)
0 (Alice, None) (Alice's Profile, https://dikshapadi.site)
1 (Bob, None) (Bob's Profile, https://aarya.fun)
'''
extract_links=None, Does not extract any links.
extract_links='header', Extract links only from the table header.
extract_links='footer', Extract links only from the table footer.
extract_links='body', Extract links only from the body of the table.
extract_links='all', Extract links from all parts of the table.
The dtype_backend parameter is new in Pandas 2.0 which is used to specify the backend for handling the types of data when reading a file.
Dictionary of storage-specific options, such as credentials for cloud storage.