Shadow Plan File

Pri

Item

Fix bugs

Handling of http://:@server/ URLs
The password needs to contain *** to trigger this. The problem happens when printing reports, so there I should probably just remove evverything before the @. That has it problems as well, but seems the most logical thing to do. See report by Petter Reinholdtson.

Fix problem with correctly parsing base URL in page.
The report by Dan Heller provides a lot of information and a reproducable case for this.

However, when trying it again on 30/4 I couldn't reproduce it. My guess is that an upgrade to HTML::LinkExtor has fixed things, even though I can't tell from the Changes file. I'm following up with Dan to confirm.

So far no luck tracking this down. It still works for me, although we have not determined why. I've mailed my development version of Checkbot just in case.

Improve current features

Infinite loop
One idea is to work with a maximum URL length. Long URLs (settable with option) would not be checked anymore, thus breaking the infinite loop.

Options

--file argument has issues
Specifically,it doesn't deal very well with paths given to the --file command. Perhaps it would be better to have --directory command instead, and put the Checkbot files in this directory (with an index.html file instead of checkbot.html).

Better default for --match argument
Only include all the directory paths, but leave out any html or other file (i.e., include everything up to and including the last / in the default match.

List all options in report
List all options on the generated page, not just the URLs and the match option?

The --match and --ignore options need to be rethought
Perhaps have separate options for internal and external links? It would also be useful to have an option to ignore specific URLs altogether. Furthermore, perhaps the names could be more clear: e.g. match -> local, exclude -> external, ignore -> ignore-problem-urls, etc.

Report

Only create report pages for servers with bugs

Show broken URLs according to the URL hierarchy

Create options for reporting order
Sort problems on the server page in a different order (e.g. critical errors first).

Sort the problem reports by page instead of by type of problem for easier fixing.

Add link text to report
Sometimes it is difficult (for the Web-editors) to locate the source of problems links, especially when the URL's is extracted from a database (as we often do). Maybe it would be useful if Checkbot could (optionally) list the link text (the text between and ) in the Checkbot output.

Internals

Report all the pages on which a broken ink appears.

Add new features

Distribution

Get Checkbot listed as a module, so that it can be installed with CPAN?

Include CGI script
Paul Williams has submitted a script for this. It can be found in the Checkbot project directory.

Options

Run command on each document
This could be used to do something with each document retrieved, e.g. run weblint or tidy on it. Obviously the pages itself can not be changed, but it would allow for flagging errors. What to do with the output, though...

Add authentication support
checkbot-ulrike has a simple solution. Perhaps include that for now?

Proxy authentication
See patch from Joerg Schneider. Also requires a custom UA.pm class. Support should really be in LWP, perhaps it has been added in the meantime?

Use list of links as starting URL
This would just be a plain list of URs generated by some other source. This could also be given on the command line, making Checkbot part of the Unix tool set.

Sustitute argument
Allow a regexp substitution to be run on each URL. Useful to filter out things like PHPSESSIDs which get added automatically. See request from Tabor Wells.

Supression file
Use a file to indicate which errors should be supressed. Check the difference with the ignore function. See patch.

--nohead option
Uses GET exclusively. Not sure how useful. Author created it to pre-laod a proxy cache.

Innards

Handle links with just a # in them.
> One suggestion: In my search, there was another linkchecker that had
> one feature I found VERY useful.� It had the ability to report links
> that were simply '#'.� It had found about 10 "null hash" links on a
> site that were not supposed to be there.� We have many such links
> because of Javascript usage.� However, often times '#' is used to
> create dummy links, of which the 10 found were exactly this. If
> Checkbot had an option to check for the "null hash", this would be a
> wonderful thing.

Add support for RobotUA
There is a patch for this from Nick Hibma. I think people don't always want this. Maybe the default should be to run with RobotUA, and there should also be an option to run without it.

Furthermore, how to deal with presenting the results, again this depends on the use case.

Use new features of HTML::Parser
For example, HTML::Parser 3.24 introduced (or fixed) counters for offset/line/column. This could be useful in the reports. Perhaps ading other information could be used as well.

Add a cache for 500 errors
Most often these are temporary timeouts, so keeping old results around makes sense.

In some cases the errors are simply the result temporary unavailability.
I've been handling this by extracting the 500-series links by hand, and
re-running them at intervals. It would be nice if 'checkbot' had some sort
internal bookkeeping, and would 'keep trying' for a specified duration (but
not just a long time-out, since that could prevent the remaining links from
being checked).

Support authentication
Add a mechanism for entering authentication data in order to check such areas as well. Possibly allow multiple pairs. Rather not enter the passwords on the commandline.

Substitute HEAD with GET

Keep state between runs
Keep state between runs, but make sure we still are able to run Checkbot on several areas (concurrently).

Uses for state information: list of consistent bad host, remembering previous bad links and just check those with a `quick' option, report on hosts which keep timing out.

Parse client-side (and server-side) MAP's if possible.

Keep an internal list of hosts to which we cannot connect, so that we avoid being stalled a while for each link to that host.

Implement hop count
Add an option to count hops instead of using match, and only hop that many links away?

Suggested for single page checking, but might be useful on a larger scale as well? Yes, for instance against serves that create recursive symlinks by accident.

This option could be further specialized to apply to specific match rules.

Fix timeout
http://people.we.mediaone.net/kfrankel/lwpfaq.txt

Setting timeout on the useragent doesn't always make URLs time out. Leon Abelman points to the FAQ and suggest either an extra option, or dealing with the timeout ourselves.

Use parallel checking

Also check local links
I tried your checkbot, and like to thank you very much for this nice tool - despite the fact that I needes something which checks local references ( / ) as well :-( -- Michael Hoennig

Add support for HTTP 1.1
This will be in libwww-perl 5.54. The code snippet to implement this is already in Checkbot, but commented out. Need to wait for 5.54, and then test it.

Report

Create individual pages for each error
I have checkbot walking a fairly large site nightly and thus the output page is quite large. It'd be really nice if it could be broken down into multiple pages based on error code

Add textual output
Could be created such that it grows on a tail?

Plain text output
Or perhaps something like CSV output which can then be used in other applications.

Open links in new window
Maybe use a Netscape feature to open problem links in a new browser, so that the problem links page remains visible and available. Frames? (*shudder*)

Include help text for error messages
Include (or link to) a page which contains explanations for the different error messages. (But watch out for server-specific messages, if any)

Summarize documents
Perhaps use HTML::Summary to provide summaries of the documents with errors?

Robot

Add Cookie support

Obey Robot rules. My current idea is to obey robot rules by default on all 'external' request, and ignore them on 'local' requests. Both of these should be changeable.
Obeying internal robots.txt might be a nice way to have a good exclude mechanism.

Use Robot rules
I have a patch for this in my mailbox from Nick Hibma.

Decide whether to add --match-url-base patch
This patch by David Brownlee makes the default --match to be the site name only. It's in my mail.

Track vague problems

Fix recursion run-away problem
I haven't seen any reproducable cases yet, so I don't know what causes this.

Two possible solutions are:

1. Scan for '///' patterns in URLs and deal with them (would have to read the specs on URIs first).

2. Use MD5 checksums on pages to trap this in some way.