Understanding the columns/fields in Nutch 2.0 Webpage
Aug 12, 2012
Before jumping in, it is helpful to quickly review how Nutch crawls the web and stores those results at a very high level as the steps of the crawl are linked to the columns in the webpage table. First, an initial set of seed urls are injected then there are repeated web crawling cycles. These crawl cycles consist of generate, fetch, parse and database update job steps. These steps use the various columns in the webpage table in the Nutch database.
The following list contains all of the columns in the webpage in the order they are in the table. When looking at the webpage table in the Nutch database remember every row in the webpage table represents an individual url. Where applicable it is noted which step of the crawl cycle the column is primarily used by.
id - Generator Field. This is used as the index of the table and consists of the url in a slightly different order (reversed domain name:protocol:port and path) from the order normally seen in your web browser so that it can be searched more quickly. Nutch contains convenience utility methods such as for unreversing urls at TableUtil. Note that using a url as the primary key means the default Nutch 2.0 design is to keep track of the current state of the crawl universe. Nutch 2.0 is not designed for keeping an archive of pages over time as they change (at least without a little modification).
headers - standard http headers including various non printing characters.
text - Parse field that is a conglameration of various text fields for general search purposes. Given advances in Solr I suspect this is no longer really needed except possibly for performance reasons.
Creative Commons Unique Search Tool Now Integrated into Firefox 1.0 - Creative Commons Skip
Navigation Home Creative Commons Menu About Licenses Public Domain Support CC Projects News
About About CC History Who Uses CC? Case Studies Videos about CC The Team Board of
status - fetch field used to store whether the link was actually fetched
|1||unfetched (links not yet fetched due to limits set in regex-urlfilter.txt, -TopN crawl parameters, etc.)|
|2||fetched (page was successfully fetched)|
|3||gone (that page no longer exists)|
|4||redir_temp (temporary redirection -- see reprUrl below for more details)|
|5||redir_perm (permanent redirection -- see reprUrl below for more details)|
markers - contains the inject, generate, fetch and parse marks with the batchId used as value in the marker*. See Nutch2Crawling
parseStatus - Parse field normally null until parsing attempted. For list of codes see. ParseStatusCodes.html
example (3 bytes):
02 00 00
modifiedTime - Fetch field - supposed to be last time signature changed (may have defect causing it to turn to 0 after multiple crawls).
score - DbUpdate field ranking a given url/page's importance. Higher is better. See NewScoring
typ - Fetch field containing the mime type Internet_media_type for the document such as text/html or application/pdf. Note that some Mime types are excluded by default and this can be modified in conf/regex-urlfilter.txt.
baseUrl - Fetch field. The base url for relative links contained in the content. Maybe be different from url if the request redirected.
content - Fetch Field - content of the URL.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta content="Apache Forrest" name="Generator">
<meta name="Forrest-version" content="0.9">
<meta name="Forrest-skin-name" content="nutch">
<title>Welcome to Apache Nutch™</title>
<link type="text/css" href="skin/basic.css" rel="stylesheet">
<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
title - Parse field - The text in the title tags of the HTML head.
Welcome to Apache Nutch™
reprUrl - Fetch field for representative urls used for redirects. The default behaviour is that the fetcher won't immediately follow redirected URLs, instead it will record them for fetching during the next round. The documentation indicates that this can be changed to immediately follow redirected urls by copying the http.redirect.max property from conf/nutch-default.xml to conf/nutch-site.xml and changing the value to a value greater than 0. However, this is not yet implemented for Nutch 2.0 at this time and every redirect is handled during the next fetch regardless of the property of http.redirect.max.*
fetchInterval - Fetch field containing default interval until next fetch in seconds (defaults to 30 days). See fetchTime field default explanation. Can be set at the url level when injecting so the field is necessary (see nutch_inject).
prevFetchTime - Fetch field - previous value of fetch time, or null if not available. This is the previous Nutch fetch time, not to be confused with modifiedTime which is the time the content was actually modified. See fetchTime field default explanation.
inlinks - DbUpdate field with inbound links useful for Linkrank. See Webgraph at NewScoring
xhttp://blog.foofactory.fi/2007/03/twice-speed-half-size.html Website up
prevSignature - Parse field -- previous signature. For more details see signature further down.
example (16 bytes):
25 59 5c 73 03 09 bb ed a0 98 5e b6 5e 0c 89 63
outlinks - DbUpdate field - outbound links
fetchTime - Fetch field used by Mapper to decide if it is time to fetch this url. See this link how-to-re-crawl-with-nutch for a well written overview. Also see the Nutch API documentation AbstractFetchSchedule. The default re-fetch schedule is somewhat simplistic. No matter if the page was changed or not, the fetchInterval remains unchanged, and the updated page fetchTime will always be set to fetchTime + fetchInterval * 1000. See DefaultFetchSchedule. A better implementation for most cases is the AdaptiveFetchSchedule AdaptiveFetchSchedule. The FetchSchedule implementation can be changed by copying the db.fetch.schedule.class property from conf/nutch-default.xml to conf/nutch-site.xml and changing the value.
retriesSinceFetch - Fetch field counter for number of retries to fetch due to (hopefully transient) errors since the last success. See AbstractFetchSchedule
protocolStatus - Fetch field - see ProtocolStatusCodes
example (3 bytes):
02 00 00
signature - This parse field contains a signature calculated every time a page is fetched so that Nutch knows whether a page has changed or not the next time it does a fetch. The default signature calculation implementation uses both content and header as information for calculating the signature. For various reasons (etags, etc.) the header can change without the actual content changing making the default implementation less than optimal for most requirements. For those looking to save some bandwidth on current status crawl or those implementing archival crawling (requires more changes than just this) the TextProfileSignature implementation is more appropriate. The signature calculation implementation can be changed by copying the db.signature.class property from conf/nutch-default.xml to conf/nutch-site.xml and changing the value to org.apache.nutch.crawl.TextProfileSignature.
example (16 bytes):
e1 f7 cc cc 49 7a 45 6b e7 fc 05 68 9a e8 ea 93
metadata - This is a mixed catch all field for metadata (see metadata-package-summary.html). The IndexMetatags plugin does not work with Nutch 2.0 but may work with patches or more recent versions. metadata-package-summary.html has more information.
*Thanks to Ferdy Galema for comments on the marker and redirect.