BetterDocs

Home

Docs

Creation | pd.read_html()

Previous Next

Method:

pd.read_html(io, , match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=', ', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True, extract_links=None, dtype_backend=<no_default>, storage_options=None*)

Reads HTML tables into a list of DataFrames.

Returns:

pandas.core.frame.DataFrame

Parameters:

io: (str or path or file-like)-

URL, file path, or HTML string to read.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read all tables from the URL
dfs = pd.read_html(io=url)

# Print the first DataFrame (the first table in the HTML)
print(dfs[0])
'''
Output: 
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
'''

match: (str or regex), Optional-

Regex pattern to match tables (default: '.+').

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables, filtering tables containing "Sales"
dfs = pd.read_html(io=url, match="Sales")

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
   Year  Sales  Revenue
0  2021    500     1000
1  2022    600     1200
'''

flavor: ('bs4' or 'lxml' or 'html5lib'), Optional-

Parsing engine.

import pandas as pd

# URL or HTML content
url = 'https://betterdocs.tech/global/python/pandas/tables.html'

# Using BeautifulSoup (default, if lxml isn't installed)
dfs_bs4 = pd.read_html(url, flavor='bs4')

# Using lxml
dfs_lxml = pd.read_html(url, flavor='lxml')

# Using html5lib
dfs_html5lib = pd.read_html(url, flavor='html5lib')

# Output the first DataFrame from each flavor
print(dfs_bs4[0])  # First DataFrame from BeautifulSoup
print(dfs_lxml[0])  # First DataFrame from lxml
print(dfs_html5lib[0])  # First DataFrame from html5lib
'''
Output:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
'''

"lxml" is best for performance, particularly for large files.

"bs4" is useful for handling poorly formatted or complex HTML.

"html5lib" is used when dealing with HTML5 documents and when compatibility with modern web standards is a priority.

header: 0, Optional-

Row number(s) to use as the column names.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, header=0)

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
'''

index_col: None, Optional-

It specifies which column should be used as the index of the resulting DataFrame.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, match="Expenditure", index_col="Year")

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
      Expenditure  Profit
Year                     
2021          200     800
2022          300     900
'''

skiprows: None, Optional-

Rows to skip at the start.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, skiprows=1)

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
     Alice  25
0      Bob  30
1  Charlie  22
'''

attrs: None, Optional-

is used to filter HTML tags based on their attributes when reading HTML tables from a file, URL, or string.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, attrs={'id': 'finance-table'})

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
   Year  Expenditure  Profit
0  2021          200     800
1  2022          300     900
'''

parse_dates: (True or False), Optional-

It is used to specify which columns should be parsed as dates during the reading of the table.

parse_dates = False (default) +

Pandas will not attempt to parse any columns as dates. If any date strings are present, they will be read as plain text.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, attrs={'class': 'dates-class'}, parse_dates=False)

# Print the first matched DataFrame
print(dfs[0]['Date'])
'''
Output: 
0    2003-10-02
1    2003-04-16
2    2023-12-29
Name: Date, dtype: object
'''

parse_dates = True +

Pandas will attempt to parse all columns with date-like values (e.g., strings in the format "YYYY-MM-DD") into datetime objects.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, attrs={'class': 'dates-class'}, parse_dates=['Date'])

# Print the first matched DataFrame
print(dfs[0]['Date'])
'''
Output: 
0   2003-10-02
1   2003-04-16
2   2023-12-29
Name: Date, dtype: datetime64[ns]
'''

thousands: single char, Optional-

Character acting as the thousands separator in numerical values.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, attrs={'class': 'numbers-class'}, thousands='|')

# Print the first matched DataFrame
print(dfs[0]['Thousands'])
'''
Output: 
0    123908213
1        78234
2         1023
Name: Thousands, dtype: int64
'''

encoding: str, Optional-

Decimal separator character.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, encoding='utf-8')

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
'''

decimal: single char, Optional-

Decimal separator character.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, attrs={'class': 'numbers-class'}, decimal=',')

# Print the first matched DataFrame
print(dfs[0]['Decimal'])
'''
Output: 
0    12021
1       12
2      921
Name: Decimal, dtype: int64
'''

converters: None, Optional-

It allows you to specify custom functions to convert or process values in certain columns while reading the file.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

def half_the_age(age):
    return int(age)/2

# Read HTML tables
dfs = pd.read_html(io=url, converters={'Age': half_the_age})

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
      Name   Age
0    Alice  12.5
1      Bob  15.0
2  Charlie  11.0
'''

na_values: None, Optional-

It is used to specify additional values that should be treated as NaN (Not a Number) while reading the data.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, match="Position", na_values=["null"])

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
    Name   Age  Position
0  Alice   NaN   Manager
1    Bob  30.0  Engineer
'''

Values: +

keep_default_na: (True or False), Optional-

It controls whether or not the default missing values (i.e., NaN values) specified by pandas should be preserved when reading the file.

keep_default_na = True (default) +

Pandas will keep the default missing value strings in the dataset (like NaN, NA, null, etc.) and convert them to NaN.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, match="Position", keep_default_na=True)

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
    Name   Age  Position
0  Alice   NaN   Manager
1    Bob  30.0  Engineer
'''

keep_default_na = False +

Pandas does not interpret the default missing values (e.g., NA, null, NaN, etc.) as missing values. It will only treat the values specified in the na_values parameter as NaN.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, match="Position", keep_default_na=False)

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
    Name   Age  Position
0  Alice  null   Manager
1    Bob    30  Engineer
'''

Values: +

If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.

If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.

If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.

If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

displayed_only: (True or False), Optional-

It allows you to filter out hidden tables from being parsed, ensuring that only the tables that are actually shown on the page are processed into DataFrames.

displayed_only = True (default) +

Only visible tables (those not hidden using CSS) are read.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, displayed_only=True)

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
'''

displayed_only = False +

All tables, including those that are hidden via CSS, will be read.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables.html"

# Read HTML tables
dfs = pd.read_html(io=url, displayed_only=False)

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
     Name  Age
0      We   25
1    Were   30
2  Hidden   22
'''

Values: +

If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.

If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.

If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.

If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

extract_links: None, Optional-

It is used to control whether to extract links from the HTML tables and include them in the resulting DataFrame.

import pandas as pd

# URL containing HTML tables
url = "https://betterdocs.tech/global/python/pandas/tables-links.html"

# Read HTML tables
dfs = pd.read_html(io=url, extract_links="all")

# Print the first matched DataFrame
print(dfs[0])
'''
Output: 
    (Name, None)                                (Link, None)
0  (Alice, None)  (Alice's Profile, https://dikshapadi.site)
1    (Bob, None)          (Bob's Profile, https://aarya.fun)
'''

extract_links=None, Does not extract any links.

extract_links='header', Extract links only from the table header.

extract_links='footer', Extract links only from the table footer.

extract_links='body', Extract links only from the body of the table.

extract_links='all', Extract links from all parts of the table.

dtype_backend: None, Optional-

The dtype_backend parameter is new in Pandas 2.0 which is used to specify the backend for handling the types of data when reading a file.

storage_options: dict, Optional-

Dictionary of storage-specific options, such as credentials for cloud storage.

Previous Next

BetterDocs

Support

EmailDiscordForms

Documentations

Python

Company

AboutDocs

Policies

Terms of ServicePrivacy Policy

Creation | pd.read_html()

Method:

Reads HTML tables into a list of DataFrames.

Returns:

pandas.core.frame.DataFrame

Parameters:

io: (str or path or file-like)-

match: (str or regex), Optional-

flavor: ('bs4' or 'lxml' or 'html5lib'), Optional-

header: 0, Optional-

index_col: None, Optional-

skiprows: None, Optional-

attrs: None, Optional-

parse_dates: (True or False), Optional-

parse_dates = False (default) +

parse_dates = True +

thousands: single char, Optional-

encoding: str, Optional-

decimal: single char, Optional-

converters: None, Optional-

na_values: None, Optional-

Values: +

NaN Values

keep_default_na: (True or False), Optional-

keep_default_na = True (default) +

keep_default_na = False +

Values: +

NaN Values

displayed_only: (True or False), Optional-

displayed_only = True (default) +

displayed_only = False +

Values: +

NaN Values

extract_links: None, Optional-

dtype_backend: None, Optional-

storage_options: dict, Optional-

BetterDocs

Support

Documentations

Company

Policies