2025年6月10日火曜日

How to use the Python module urllib.request in Python code

 The urllib.request module in Python is a powerful and flexible module for opening and reading URLs (Uniform Resource Locators). It provides functions and classes to fetch data from various types of URLs, including HTTP, HTTPS, FTP, and local files.

Here's a breakdown of how to use urllib.request with practical examples:

Basic GET Request (Fetching Web Page Content)

The simplest use case is to fetch the content of a web page using an HTTP GET request.

Python
import urllib.request

url = 'https://www.example.com'

try:
    # Open the URL
    with urllib.request.urlopen(url) as response:
        # Read the content
        html_content = response.read()

        # Decode the content (important for text)
        # You might need to check response.info().get_content_charset()
        # for the correct encoding, or try 'utf-8' or 'latin-1'
        decoded_content = html_content.decode('utf-8')

        print(decoded_content[:500]) # Print first 500 characters
        print(f"\nStatus Code: {response.status}")
        print(f"Headers: {response.info()}")

except urllib.error.URLError as e:
    print(f"Error accessing URL: {e.reason}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Explanation:

  1. import urllib.request: Imports the necessary module.
  2. urllib.request.urlopen(url): This is the core function. It opens the URL and returns a file-like object (http.client.HTTPResponse for HTTP URLs).
  3. with ... as response:: This is a context manager, ensuring that the connection is properly closed even if errors occur.
  4. response.read(): Reads the entire content of the response body as bytes.
  5. .decode('utf-8'): Converts the bytes content into a string using the specified encoding. Web pages are commonly UTF-8.
  6. response.status: Gets the HTTP status code (e.g., 200 for OK, 404 for Not Found).
  7. response.info(): Returns a http.client.HTTPMessage object containing the response headers.
  8. Error Handling (try...except): It's crucial to handle potential network errors (urllib.error.URLError) or other exceptions.

Handling HTTP POST Requests (Sending Data)

To send data to a server (e.g., submitting a form), you'll typically use an HTTP POST request. You need to encode the data and pass it as the data argument to urlopen().

Python
import urllib.request
import urllib.parse # For encoding data

url = 'https://httpbin.org/post' # A public service for testing HTTP requests

# Data to send (usually a dictionary)
post_data = {
    'name': 'Alice',
    'age': '30',
    'city': 'New York'
}

# Encode the data for URL submission (x-www-form-urlencoded format)
encoded_data = urllib.parse.urlencode(post_data).encode('utf-8')

try:
    # Create a Request object
    req = urllib.request.Request(url, data=encoded_data, method='POST')
    # You can also add headers here if needed, e.g., for Content-Type
    req.add_header('Content-Type', 'application/x-www-form-urlencoded')
    # req.add_header('User-Agent', 'Mozilla/5.0') # Good practice to set User-Agent

    with urllib.request.urlopen(req) as response:
        print(f"Status Code: {response.status}")
        print(response.read().decode('utf-8'))

except urllib.error.URLError as e:
    print(f"Error accessing URL: {e.reason}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Explanation:

  1. urllib.parse.urlencode(post_data): Converts a dictionary of key-value pairs into a URL-encoded string (e.g., name=Alice&age=30).
  2. .encode('utf-8'): The data argument to urlopen must be bytes, so encode the string.
  3. urllib.request.Request(url, data=encoded_data, method='POST'): Instead of directly using urlopen(url), we create a Request object. This gives us more control, allowing us to specify the HTTP method, add headers, etc.
  4. req.add_header(...): Adds custom HTTP headers to the request. Content-Type is important for POST requests to tell the server how the data is formatted. User-Agent is good practice to prevent some servers from blocking requests from unknown clients.

Adding Custom Headers (User-Agent, Authorization, etc.)

You can add custom headers to your requests using the Request object.

Python
import urllib.request

url = 'https://www.example.com'

# Create a Request object
req = urllib.request.Request(url)

# Add headers
req.add_header('User-Agent', 'MyCustomPythonApp/1.0')
req.add_header('Accept-Language', 'en-US,en;q=0.5')
# For authorization
# req.add_header('Authorization', 'Bearer YOUR_TOKEN_HERE')

try:
    with urllib.request.urlopen(req) as response:
        print(f"Status Code: {response.status}")
        print(response.info()) # Print all response headers
        print(response.read().decode('utf-8')[:200])

except urllib.error.URLError as e:
    print(f"Error accessing URL: {e.reason}")

Handling Redirects, Cookies, and Authentication (OpenerDirector)

For more complex scenarios, urllib.request provides OpenerDirector and handlers. This allows you to manage things like cookies, proxies, HTTP authentication, and more robust redirect handling.

Python
import urllib.request
import http.cookiejar # For cookie management

# Create a CookieJar to store cookies
cookie_jar = http.cookiejar.CookieJar()

# Create a HTTPCookieProcessor to handle cookies
cookie_handler = urllib.request.HTTPCookieProcessor(cookie_jar)

# Create a custom opener
opener = urllib.request.build_opener(cookie_handler, urllib.request.HTTPHandler, urllib.request.HTTPSHandler)

# Now, use the opener to open URLs
url = 'https://www.example.com' # Or a site that uses cookies

try:
    with opener.open(url) as response:
        print(f"Status Code: {response.status}")
        print("Cookies after first request:")
        for cookie in cookie_jar:
            print(f"  Name: {cookie.name}, Value: {cookie.value}")

    # Make another request to the same domain to see if cookies are sent
    # This might require a site that actually sets/uses cookies for a visible effect
    # with opener.open(url) as response_again:
    #     print("\nStatus Code for second request:", response_again.status)
    #     print("Cookies sent with second request (if any were set):", cookie_jar)

except urllib.error.URLError as e:
    print(f"Error accessing URL: {e.reason}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Explanation:

  1. http.cookiejar.CookieJar(): An object that stores HTTP cookies.
  2. urllib.request.HTTPCookieProcessor(cookie_jar): A handler that knows how to process cookies (sending and receiving them).
  3. urllib.request.build_opener(...): This function creates an OpenerDirector object. You pass it various "handlers" for different protocols or features (like cookie handling, HTTP/HTTPS handling, proxy handling, etc.).
  4. opener.open(url): You use this custom opener object instead of the global urllib.request.urlopen(). All requests made with this opener will use the handlers configured for it.

Handling Exceptions

It's vital to handle exceptions to make your network requests robust.

  • urllib.error.URLError: Raised for problems with the URL (e.g., unknown host, no network connection). It has a reason attribute.
  • urllib.error.HTTPError: A subclass of URLError, specifically for HTTP errors (e.g., 404 Not Found, 403 Forbidden, 500 Internal Server Error). It has code (HTTP status code) and headers attributes.
Python
import urllib.request
import urllib.error

invalid_url = 'http://this-is-not-a-real-domain-123456789.com'
forbidden_url = 'https://httpbin.org/status/403' # Example URL that returns 403

# Example 1: URLError (DNS lookup failure)
try:
    with urllib.request.urlopen(invalid_url) as response:
        pass
except urllib.error.URLError as e:
    print(f"URLError for {invalid_url}: {e.reason}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

print("-" * 30)

# Example 2: HTTPError (Forbidden)
try:
    with urllib.request.urlopen(forbidden_url) as response:
        pass
except urllib.error.HTTPError as e:
    print(f"HTTPError for {forbidden_url}: Status {e.code}, Reason: {e.reason}")
    # You can also access headers: print(e.headers)
except urllib.error.URLError as e: # Catch URLError as well for broader issues
    print(f"URLError (non-HTTP specific) for {forbidden_url}: {e.reason}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Other Useful Features:

  • Proxies: Use urllib.request.ProxyHandler with build_opener.
  • Basic Authentication: Use urllib.request.HTTPBasicAuthHandler with build_opener.
  • Timeouts: You can pass a timeout argument to urlopen() to set a maximum time for the request to complete.
    Python
    try:
        with urllib.request.urlopen(url, timeout=5) as response: # 5 second timeout
            print(response.status)
    except urllib.error.URLError as e:
        print(f"Timeout or other URL error: {e.reason}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
    

urllib.request is a fundamental module, but for more advanced HTTP features (like easier JSON handling, session management, file uploads, etc.), the third-party requests library is often preferred due to its simpler API and richer functionality. However, urllib.request is built-in and sufficient for many common tasks.

0 件のコメント:

コメントを投稿