URL
Several HTML elements, most notably the A element, may contain an attribute which takes a URL as value. URLs, Uniform Resource Locators, are addresses of Web documents. More generally, URLs can be used on the Web to refer to "objects" on the Web or in other information systems.
The general syntax of absolute URLs is the following:
scheme://
host:
port/
path/
filename
where
- scheme
- specifies the information system (technically speaking, the protocol) to be used to access the resource; possible values include the following:
http
a Web document (to be accessed using Hypertext Transfer Protocol, HTTP) ftp
a resource to be retrieved using FTP (File Transfer Protocol), usually a file in a so-called FTP server, file
a file on a particular computer; a file
URL is hardly useful on the Webgopher
a file in a Gopher server mailto
electronic mail address news
a newsgroup or an article in Usenet news telnet
for starting an interactive session via the Telnet protocol (which is part of TCP/IP) - host
- is the Internet host name in the domain notation, eg
www.hut.fi
(or sometimes a numerical TCP/IP address); notice that typically, but not necessarily, Web servers have domain names starting withwww
:
port- is the port number part, which can usually be omitted since it has a reasonable default; that is, omit it, unless it is a part of a URL which you got somewhere (or you really know what you are doing)
- path
- is a directory path within the host
- filename
- is a file name within the directory.
Warning: Although many browsers allow you to omit the
part http://
when specifying the URL of a document to be
visited, you must not omit it in when writing a normal URL
into an HTML document. (Otherwise browsers will try to interpret it
as a relative URL.)
Actually, this pattern is mainly for Web documents, ie http
URLs. For other URLs, simplifications and special interpretations are
applied. For example, a mailto
URL is just of the form
mailto
:address where address is
a normal Internet E-mail address like
Jukka.Korpela@hut.fi
(as specified in
RFC 822).
Please notice that appending anything to the E-mail address in
a mailto
URL
is nonstandard and
may result in lost mail without
anyone noticing! (See also
the discussion of mailto:
URLs in the description of the
A element.)
An http
URL can also be
a fragment identifier
which consists of an absolute URL, the # sign and a
name (which refers to a location within the
document specified by the absolute URL).
See the description of the A element for more information.
It is safest to enclose URLs in quotes when writing them as attribute values in HTML.
For an overview of URLs, see W3C material on addressing.
As regards to the technical specifications of the syntax of URLs, see RFC 1738 (absolute URLs) and RFC 1808 (relative URLs).
In particular, the specifications say that within a URL only a limited set of characters can be used as such:
- alphanumeric characters (
A
toZ
,a
toz
,0
to9
) - the characters
$-_.+!*'(),
- the characters
;/?:@=&#
provided that they are used in the special meaning reserved for them in the RFCs mentioned above.
Other characters must be encoded.
(The characters ;/?:@=&#
must also be encoded, if they
are not used in the special meaning.)
This encoding (which is defined by URL specifications, not HTML
specifications) consists of using the percent sign followed by two
hexadecimal digits, presenting the code position.
For example, tilde (~
) should be presented as
%7E
and space as %20
.
(Violating the rules causes problems
much more likely in the latter case than in the former.)
When a URL occurs as an attribute value in HTML, there is another complication caused by the & character which may have special use in query form submissions. In principle, that character should be escaped as & or as & (there is a footnote in the HTML 2.0 specification about this) and browsers should process it so that the actual URL passed to the processing CGI script has that notation replaced by plain & character. (Notice that it must not be encoded. This is a confusing issue, and CGI scripts should really be written so that semicolon ; and not ampersand & is used as field separator.)