Gem: hpricot

master
Alinson S. Xavier 18 years ago
parent 2bae9db309
commit e9adbbd345

@ -0,0 +1,62 @@
= 0.6
=== 15th June, 2007
* Hpricot for JRuby -- nice work Ola Bini!
* Inline Markaby for Hpricot documents.
* XML tags and attributes are no longer downcased like HTML is.
* new syntax for grabbing everything between two elements using a Range in the search method: (doc/("font".."font/br")) or in nodes_at like so: (doc/"font").nodes_at("*".."br"). Only works with either a pair of siblings or a set of a parent and a sibling.
* Ignore self-closing endings on tags (such as form) which are containers. Treat them like open parent tags. Reported by Jonathan Nichols on the hpricot list.
* Escaping of attributes, yanked from Jim Weirich and Sam Ruby's work in Builder.
* Element#raw_attributes gives unescaped data. Element#attributes gives escaped.
* Added: Elements#attr, Elements#remove_attr, Elements#remove_class.
* Added: Traverse#preceding, Traverse#following, Traverse#previous, Traverse#next.
= 0.5
=== 31rd January, 2007
* support for a[text()="Click Me!"] and h3[text()*="space"] and the like.
* Hpricot.buffer_size accessor for increasing Hpricot's buffer if you're encountering huge ASP.NET viewstate attribs.
* some support for colons in tag names (not full namespace support yet.)
* Element.to_original_html will attempt to preserve the original HTML while merging your changes.
* Element.to_plain_text converts an element's contents to a simple text format.
* Element.inner_text removes all tags and returns text nodes concatenated into a single string.
* no @raw_string variable kept for comments, text, and cdata -- as it's redundant.
* xpath-style indices (//p/a[1]) but keep in mind that they aren't zero-based.
* node_position is the index among all sibling nodes, while position is the position among children of identical type.
* comment() and text() search criteria, like: //p/text(), which selects all text inside paragraph tags.
* every element has css_path and xpath methods which return respective absolute paths.
* more flexibility all around: in parsing attributes, tags, comments and cdata.
= 0.4
=== 11th August, 2006
* The :fixup_tags option will try to sort out the hierarchy so elements end up with the right parents.
* Elements such as *script* and *style* (identified as having CDATA contents) receive a single text node as their children now. Previously, Hpricot was parsing out tags found in scripts.
* Better scanning of partially quoted attributes (found by Brent Beardsly on http://uswebgen.com/)
* Better scanning of unquoted attributes -- thanks to Aaron Patterson for the test cases!
* Some tags were being output in the empty tag style, although browsers hated that. FIXED!
* Added Elements#at for finding single elements.
* Added Elem::Trav#[] and Elem::Trav#[]= for reading and writing attributes.
= 0.3
=== 7th July, 2006
* Fixed negative string size error on empty tokens. (news.bbc.co.uk)
* Allow the parser to accept just text nodes. (such as: <tt>Hpricot.parse('TEXT')</tt>)
* from JQuery to Hpricot::Elements: remove, empty, append, prepend, before, after, wrap, set,
html(...), to_html, to_s.
* on containers: to_html, replace_child, insert_before, insert_after, innerHTML=.
* Hpricot(...) is an alias for parse.
* open up all properties to setters, let people do as they may.
* use to_html for the full html of a node or set of elements.
* doctypes were messed.
= 0.2
=== 4th July, 2006
* Rewrote the HTree parser to be simpler, more adequate for the common man. Will add encoding back in later.
= 0.1
=== 3rd July, 2006
* For whatever reason, wrote this HTML parser in C.
I guess Ragel is addictive and I want to improve HTree.

@ -0,0 +1,18 @@
Copyright (c) 2006 why the lucky stiff
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to
deal in the Software without restriction, including without limitation the
rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

@ -0,0 +1,284 @@
= Hpricot, Read Any HTML
Hpricot is a fast, flexible HTML parser written in C. It's designed to be very
accommodating (like Tanaka Akira's HTree) and to have a very helpful library
(like some JavaScript libs -- JQuery, Prototype -- give you.) The XPath and CSS
parser, in fact, is based on John Resig's JQuery.
Also, Hpricot can be handy for reading broken XML files, since many of the same
techniques can be used. If a quote is missing, Hpricot tries to figure it out.
If tags overlap, Hpricot works on sorting them out. You know, that sort of
thing.
*Please read this entire document* before making assumptions about how this
software works.
== An Overview
Let's clear up what Hpricot is.
# Hpricot is *a standalone library*. It requires no other libraries. Just Ruby!
# While priding itself on speed, Hpricot *works hard to sort out bad HTML* and
pays a small penalty in order to get that right. So that's slightly more important
to me than speed.
# *If you can see it in Firefox, then Hpricot should parse it.* That's
how it should be! Let me know the minute it's otherwise.
# Primarily, Hpricot is used for reading HTML and tries to sort out troubled
HTML by having some idea of what good HTML is. Some people still like to use
Hpricot for XML reading, but *remember to use the Hpricot::XML() method* for that!
== The Hpricot Kingdom
First, here are all the links you need to know:
* http://code.whytheluckystiff.net/hpricot is the Hpricot wiki and bug tracker.
Go there for news and recipes and patches. It's the center of activity.
* http://code.whytheluckystiff.net/svn/hpricot/trunk is the main Subversion
repository for Hpricot. You can get the latest code there.
* http://code.whytheluckystiff.net/doc/hpricot is the home for the latest copy of
this reference.
* See COPYING for the terms of this software. (Spoiler: it's absolutely free.)
If you have any trouble, don't hesitate to contact the author. As always, I'm
not going to say "Use at your own risk" because I don't want this library to be
risky. If you trip on something, I'll share the liability by repairing things
as quickly as I can. Your responsibility is to report the inadequacies.
== Installing Hpricot
You may get the latest stable version from Rubyforge. Win32 binaries and source
gems are available.
$ gem install hpricot
As Hpricot is still under active development, you can also try the most recent
candidate build here:
$ gem install hpricot --source http://code.whytheluckystiff.net
The development gem is usually in pretty good shape actually. You can also
get the bleeding edge code or plain Ruby tarballs on the wiki.
== An Hpricot Showcase
We're going to run through a big pile of examples to get you jump-started.
Many of these examples are also found at
http://code.whytheluckystiff.net/hpricot/wiki/HpricotBasics, in case you
want to add some of your own.
=== Loading Hpricot Itself
You have probably got the gem, right? To load Hpricot:
require 'rubygems'
require 'hpricot'
If you've installed the plain source distribution, go ahead and just:
require 'hpricot'
=== Load an HTML Page
The <tt>Hpricot()</tt> method takes a string or any IO object and loads the
contents into a document object.
doc = Hpricot("<p>A simple <b>test</b> string.</p>")
To load from a file, just get the stream open:
doc = open("index.html") { |f| Hpricot(f) }
To load from a web URL, use <tt>open-uri</tt>, which comes with Ruby:
require 'open-uri'
doc = open("http://qwantz.com/") { |f| Hpricot(f) }
Hpricot uses an internal buffer to parse the file, so the IO will stream
properly and large documents won't be loaded into memory all at once. However,
the parsed document object will be present in memory, in its entirety.
=== Search for Elements
Use <tt>Doc.search</tt>:
doc.search("//p[@class='posted']")
#=> #<Hpricot:Elements[{p ...}, {p ...}]>
<tt>Doc.search</tt> can take an XPath or CSS expression. In the above example,
all paragraph <tt><p></tt> elements are grabbed which have a <tt>class</tt>
attribute of <tt>"posted"</tt>.
A shortcut is to use the divisor:
(doc/"p.posted")
#=> #<Hpricot:Elements[{p ...}, {p ...}]>
=== Finding Just One Element
If you're looking for a single element, the <tt>at</tt> method will return the
first element matched by the expression. In this case, you'll get back the
element itself rather than the <tt>Hpricot::Elements</tt> array.
doc.at("body")['onload']
The above code will find the body tag and give you back the <tt>onload</tt>
attribute. This is the most common reason to use the element directly: when
reading and writing HTML attributes.
=== Fetching the Contents of an Element
Just as with browser scripting, the <tt>inner_html</tt> property can be used to
get the inner contents of an element.
(doc/"#elementID").inner_html
#=> "..<b>contents</b>.."
If your expression matches more than one element, you'll get back the contents
of ''all the matched elements''. So you may want to use <tt>first</tt> to be
sure you get back only one.
(doc/"#elementID").first.inner_html
#=> "..<b>contents</b>.."
=== Fetching the HTML for an Element
If you want the HTML for the whole element (not just the contents), use
<tt>to_html</tt>:
(doc/"#elementID").to_html
#=> "<div id='elementID'>...</div>"
=== Looping
All searches return a set of <tt>Hpricot::Elements</tt>. Go ahead and loop
through them like you would an array.
(doc/"p/a/img").each do |img|
puts img.attributes['class']
end
=== Continuing Searches
Searches can be continued from a collection of elements, in order to search deeper.
# find all paragraphs.
elements = doc.search("/html/body//p")
# continue the search by finding any images within those paragraphs.
(elements/"img")
#=> #<Hpricot::Elements[{img ...}, {img ...}]>
Searches can also be continued by searching within container elements.
# find all images within paragraphs.
doc.search("/html/body//p").each do |para|
puts "== Found a paragraph =="
pp para
imgs = para.search("img")
if imgs.any?
puts "== Found #{imgs.length} images inside =="
end
end
Of course, the most succinct ways to do the above are using CSS or XPath.
# the xpath version
(doc/"/html/body//p//img")
# the css version
(doc/"html > body > p img")
# ..or symbols work, too!
(doc/:html/:body/:p/:img)
=== Looping Edits
You may certainly edit objects from within your search loops. Then, when you
spit out the HTML, the altered elements will show.
(doc/"span.entryPermalink").each do |span|
span.attributes['class'] = 'newLinks'
end
puts doc
This changes all <tt>span.entryPermalink</tt> elements to
<tt>span.newLinks</tt>. Keep in mind that there are often more convenient ways
of doing this. Such as the <tt>set</tt> method:
(doc/"span.entryPermalink").set(:class => 'newLinks')
=== Figuring Out Paths
Every element can tell you its unique path (either XPath or CSS) to get to the
element from the root tag.
The <tt>css_path</tt> method:
doc.at("div > div:nth(1)").css_path
#=> "div > div:nth(1)"
doc.at("#header").css_path
#=> "#header"
Or, the <tt>xpath</tt> method:
doc.at("div > div:nth(1)").xpath
#=> "/div/div:eq(1)"
doc.at("#header").xpath
#=> "//div[@id='header']"
== Hpricot Fixups
When loading HTML documents, you have a few settings that can make Hpricot more
or less intense about how it gets involved.
== :fixup_tags
Really, there are so many ways to clean up HTML and your intentions may be to
keep the HTML as-is. So Hpricot's default behavior is to keep things flexible.
Making sure to open and close all the tags, but ignore any validation problems.
As of Hpricot 0.4, there's a new <tt>:fixup_tags</tt> option which will attempt
to shift the document's tags to meet XHTML 1.0 Strict.
doc = open("index.html") { |f| Hpricot f, :fixup_tags => true }
This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow
the rules a bit better. Like: say Hpricot finds a paragraph in a link, it's
going to move the paragraph below the link. Or up and out of other elements
where paragraphs don't belong.
If an unknown element is found, it is ignored. Again, <tt>:fixup_tags</tt>.
== :xhtml_strict
So, let's go beyond just trying to fix the hierarchy. The
<tt>:xhtml_strict</tt> option really tries to force the document to be an XHTML
1.0 Strict document. Even at the cost of removing elements that get in the way.
doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true }
What measures does <tt>:xhtml_strict</tt> take?
1. Shift elements into their proper containers just like :fixup_tags.
2. Remove unknown elements.
3. Remove unknown attributes.
4. Remove illegal content.
5. Alter the doctype to XHTML 1.0 Strict.
== Hpricot.XML()
The last option is the <tt>:xml</tt> option, which makes some slight variations
on the standard mode. The main difference is that :xml mode won't try to output
tags which are friendlier for browsers. For example, if an opening and closing
<tt>br</tt> tag is found, XML mode won't try to turn that into an empty element.
XML mode also doesn't downcase the tags and attributes for you. So pay attention
to case, friends.
The primary way to use Hpricot's XML mode is to call the Hpricot.XML method:
doc = open("http://redhanded.hobix.com/index.xml") do |f|
Hpricot.XML(f)
end
*Also, :fixup_tags is canceled out by the :xml option.* This is because
:fixup_tags makes assumptions based how HTML is structured. Specifically, how
tags are defined in the XHTML 1.0 DTD.

@ -0,0 +1,211 @@
require 'rake'
require 'rake/clean'
require 'rake/gempackagetask'
require 'rake/rdoctask'
require 'rake/testtask'
require 'fileutils'
include FileUtils
NAME = "hpricot"
REV = `svn info`[/Revision: (\d+)/, 1] rescue nil
VERS = ENV['VERSION'] || "0.6" + (REV ? ".#{REV}" : "")
PKG = "#{NAME}-#{VERS}"
BIN = "*.{bundle,jar,so,obj,pdb,lib,def,exp}"
ARCHLIB = "lib/#{::Config::CONFIG['arch']}"
CLEAN.include ["ext/hpricot_scan/#{BIN}", "lib/**/#{BIN}", 'ext/hpricot_scan/Makefile',
'**/.*.sw?', '*.gem', '.config']
RDOC_OPTS = ['--quiet', '--title', 'The Hpricot Reference', '--main', 'README', '--inline-source']
PKG_FILES = %w(CHANGELOG COPYING README Rakefile) +
Dir.glob("{bin,doc,test,lib,extras}/**/*") +
Dir.glob("ext/**/*.{h,java,c,rb,rl}") +
%w[ext/hpricot_scan/hpricot_scan.c] # needed because it's generated later
SPEC =
Gem::Specification.new do |s|
s.name = NAME
s.version = VERS
s.platform = Gem::Platform::RUBY
s.has_rdoc = true
s.rdoc_options += RDOC_OPTS
s.extra_rdoc_files = ["README", "CHANGELOG", "COPYING"]
s.summary = "a swift, liberal HTML parser with a fantastic library"
s.description = s.summary
s.author = "why the lucky stiff"
s.email = 'why@ruby-lang.org'
s.homepage = 'http://code.whytheluckystiff.net/hpricot/'
s.files = PKG_FILES
s.require_paths = [ARCHLIB, "lib"]
s.extensions = FileList["ext/**/extconf.rb"].to_a
s.bindir = "bin"
end
desc "Does a full compile, test run"
task :default => [:compile, :test]
desc "Packages up Hpricot."
task :package => [:clean, :ragel]
desc "Releases packages for all Hpricot packages and platforms."
task :release => [:package, :package_win32, :package_jruby]
desc "Run all the tests"
Rake::TestTask.new do |t|
t.libs << "test" << ARCHLIB
t.test_files = FileList['test/test_*.rb']
t.verbose = true
end
Rake::RDocTask.new do |rdoc|
rdoc.rdoc_dir = 'doc/rdoc'
rdoc.options += RDOC_OPTS
rdoc.main = "README"
rdoc.rdoc_files.add ['README', 'CHANGELOG', 'COPYING', 'lib/**/*.rb']
end
Rake::GemPackageTask.new(SPEC) do |p|
p.need_tar = true
p.gem_spec = SPEC
end
extension = "hpricot_scan"
ext = "ext/hpricot_scan"
ext_so = "#{ext}/#{extension}.#{Config::CONFIG['DLEXT']}"
ext_files = FileList[
"#{ext}/*.c",
"#{ext}/*.h",
"#{ext}/*.rl",
"#{ext}/extconf.rb",
"#{ext}/Makefile",
"lib"
]
task "lib" do
directory "lib"
end
desc "Compiles the Ruby extension"
task :compile => [:hpricot_scan] do
if Dir.glob(File.join(ARCHLIB,"hpricot_scan.*")).length == 0
STDERR.puts "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
STDERR.puts "Gem actually failed to build. Your system is"
STDERR.puts "NOT configured properly to build hpricot."
STDERR.puts "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
exit(1)
end
end
task :hpricot_scan => [:ragel]
desc "Builds just the #{extension} extension"
task extension.to_sym => ["#{ext}/Makefile", ext_so ]
file "#{ext}/Makefile" => ["#{ext}/extconf.rb"] do
Dir.chdir(ext) do ruby "extconf.rb" end
end
file ext_so => ext_files do
Dir.chdir(ext) do
sh(PLATFORM =~ /win32/ ? 'nmake' : 'make')
end
mkdir_p ARCHLIB
cp ext_so, ARCHLIB
end
desc "returns the ragel version"
task :ragel_version do
@ragel_v = `ragel -v`[/(version )(\S*)/,2].to_f
end
desc "Generates the C scanner code with Ragel."
task :ragel => [:ragel_version] do
sh %{ragel ext/hpricot_scan/hpricot_scan.rl | #{@ragel_v >= 5.18 ? 'rlgen-cd' : 'rlcodegen'} -G2 -o ext/hpricot_scan/hpricot_scan.c}
end
desc "Generates the Java scanner code with Ragel."
task :ragel_java => [:ragel_version] do
sh %{ragel -J ext/hpricot_scan/hpricot_scan.java.rl | #{@ragel_v >= 5.18 ? 'rlgen-java' : 'rlcodegen'} -o ext/hpricot_scan/HpricotScanService.java}
end
### Win32 Packages ###
Win32Spec = SPEC.dup
Win32Spec.platform = Gem::Platform::WIN32
Win32Spec.files = PKG_FILES + ["#{ARCHLIB}/hpricot_scan.so"]
Win32Spec.extensions = []
WIN32_PKG_DIR = "#{PKG}-mswin32"
desc "Package up the Win32 distribution."
file WIN32_PKG_DIR => [:package] do
sh "tar zxf pkg/#{PKG}.tgz"
mv PKG, WIN32_PKG_DIR
end
desc "Cross-compile the hpricot_scan extension for win32"
file "hpricot_scan_win32" => [WIN32_PKG_DIR] do
cp "extras/mingw-rbconfig.rb", "#{WIN32_PKG_DIR}/ext/hpricot_scan/rbconfig.rb"
sh "cd #{WIN32_PKG_DIR}/ext/hpricot_scan/ && ruby -I. extconf.rb && make"
mv "#{WIN32_PKG_DIR}/ext/hpricot_scan/hpricot_scan.so", "#{WIN32_PKG_DIR}/#{ARCHLIB}"
end
desc "Build the binary RubyGems package for win32"
task :package_win32 => ["hpricot_scan_win32"] do
Dir.chdir("#{WIN32_PKG_DIR}") do
Gem::Builder.new(Win32Spec).build
verbose(true) {
mv Dir["*.gem"].first, "../pkg/#{WIN32_PKG_DIR}.gem"
}
end
end
CLEAN.include WIN32_PKG_DIR
### JRuby Packages ###
compile_java = proc do
sh %{javac -source 1.4 -target 1.4 -classpath $JRUBY_HOME/lib/jruby.jar HpricotScanService.java}
sh %{jar cf hpricot_scan.jar HpricotScanService.class}
end
desc "Compiles the JRuby extension"
task :hpricot_scan_java => [:ragel_java] do
Dir.chdir("ext/hpricot_scan", &compile_java)
end
JRubySpec = SPEC.dup
JRubySpec.platform = 'jruby'
JRubySpec.files = PKG_FILES + ["#{ARCHLIB}/hpricot_scan.jar"]
JRubySpec.extensions = []
JRUBY_PKG_DIR = "#{PKG}-jruby"
desc "Package up the JRuby distribution."
file JRUBY_PKG_DIR => [:ragel_java, :package] do
sh "tar zxf pkg/#{PKG}.tgz"
mv PKG, JRUBY_PKG_DIR
end
desc "Cross-compile the hpricot_scan extension for JRuby"
file "hpricot_scan_jruby" => [JRUBY_PKG_DIR] do
Dir.chdir("#{JRUBY_PKG_DIR}/ext/hpricot_scan", &compile_java)
mv "#{JRUBY_PKG_DIR}/ext/hpricot_scan/hpricot_scan.jar", "#{JRUBY_PKG_DIR}/#{ARCHLIB}"
end
desc "Build the RubyGems package for JRuby"
task :package_jruby => ["hpricot_scan_jruby"] do
Dir.chdir("#{JRUBY_PKG_DIR}") do
Gem::Builder.new(JRubySpec).build
verbose(true) {
mv Dir["*.gem"].first, "../pkg/#{JRUBY_PKG_DIR}.gem"
}
end
end
CLEAN.include JRUBY_PKG_DIR
task :install do
sh %{rake package}
sh %{sudo gem install pkg/#{NAME}-#{VERS}}
end
task :uninstall => [:clean] do
sh %{sudo gem uninstall #{NAME}}
end

File diff suppressed because it is too large Load Diff

@ -0,0 +1,6 @@
require 'mkmf'
dir_config("hpricot_scan")
have_library("c", "main")
create_makefile("hpricot_scan")

@ -0,0 +1,76 @@
%%{
machine hpricot_common;
#
# HTML tokens
# (a blatant rip from HTree)
#
newline = '\n' @{curline += 1;} ;
NameChar = [\-A-Za-z0-9._:?] ;
Name = [A-Za-z_:] NameChar* ;
StartComment = "<!--" ;
EndComment = "-->" ;
StartCdata = "<![CDATA[" ;
EndCdata = "]]>" ;
NameCap = Name >_tag %tag;
NameAttr = NameChar+ >_akey %akey ;
Q1Char = ( "\\\'" | [^'] ) ;
Q1Attr = Q1Char* >_aval %aval ;
Q2Char = ( "\\\"" | [^"] ) ;
Q2Attr = Q2Char* >_aval %aval ;
UnqAttr = ( space >_aval | [^ \t\r\n<>"'] >_aval [^ \t\r\n<>]* %aunq ) ;
Nmtoken = NameChar+ >_akey %akey ;
Attr = NameAttr space* "=" space* ('"' Q2Attr '"' | "'" Q1Attr "'" | UnqAttr space+ ) space* ;
AttrEnd = ( NameAttr space* "=" space* UnqAttr? | Nmtoken >new_attr %save_attr ) ;
AttrSet = ( Attr >new_attr %save_attr | Nmtoken >new_attr space+ %save_attr ) ;
StartTag = "<" NameCap space+ AttrSet* (AttrEnd >new_attr %save_attr)? ">" | "<" NameCap ">";
EmptyTag = "<" NameCap space+ AttrSet* (AttrEnd >new_attr %save_attr)? "/>" | "<" NameCap "/>" ;
EndTag = "</" NameCap space* ">" ;
XmlVersionNum = [a-zA-Z0-9_.:\-]+ >_aval %xmlver ;
XmlVersionInfo = space+ "version" space* "=" space* ("'" XmlVersionNum "'" | '"' XmlVersionNum '"' ) ;
XmlEncName = [A-Za-z] >_aval [A-Za-z0-9._\-]* %xmlenc ;
XmlEncodingDecl = space+ "encoding" space* "=" space* ("'" XmlEncName "'" | '"' XmlEncName '"' ) ;
XmlYesNo = ("yes" | "no") >_aval %xmlsd ;
XmlSDDecl = space+ "standalone" space* "=" space* ("'" XmlYesNo "'" | '"' XmlYesNo '"') ;
XmlDecl = "<?xml" XmlVersionInfo XmlEncodingDecl? XmlSDDecl? space* "?"? ">" ;
SystemLiteral = '"' [^"]* >_aval %sysid '"' | "'" [^']* >_aval %sysid "'" ;
PubidLiteral = '"' [\t a-zA-Z0-9\-'()+,./:=?;!*\#@$_%]* >_aval %pubid '"' |
"'" [\t a-zA-Z0-9\-'()+,./:=?;!*\#@$_%]* >_aval %pubid "'" ;
ExternalID = ( "SYSTEM" | "PUBLIC" space+ PubidLiteral ) (space+ SystemLiteral)? ;
DocType = "<!DOCTYPE" space+ NameCap (space+ ExternalID)? space* ("[" [^\]]* "]" space*)? ">" ;
StartXmlProcIns = "<?" Name >{ TEXT_PASS(); } space+ ;
EndXmlProcIns = "?"? ">" ;
html_comment := |*
EndComment @{ EBLK(comment, 3); fgoto main; };
any | newline { TEXT_PASS(); };
*|;
html_cdata := |*
EndCdata @{ EBLK(cdata, 3); fgoto main; };
any | newline { TEXT_PASS(); };
*|;
html_procins := |*
EndXmlProcIns @{ EBLK(procins, 2); fgoto main; };
any | newline { TEXT_PASS(); };
*|;
main := |*
XmlDecl >newEle { ELE(xmldecl); };
DocType >newEle { ELE(doctype); };
StartXmlProcIns >newEle { fgoto html_procins; };
StartTag >newEle { ELE(stag); };
EndTag >newEle { ELE(etag); };
EmptyTag >newEle { ELE(emptytag); };
StartComment >newEle { fgoto html_comment; };
StartCdata >newEle { fgoto html_cdata; };
any | newline { TEXT_PASS(); };
*|;
}%%;

File diff suppressed because it is too large Load Diff

@ -0,0 +1,79 @@
/*
* hpricot_scan.h
*
* $Author: why $
* $Date: 2006-05-08 22:03:50 -0600 (Mon, 08 May 2006) $
*
* Copyright (C) 2006 why the lucky stiff
* You can redistribute it and/or modify it under the same terms as Ruby.
*/
#ifndef hpricot_scan_h
#define hpricot_scan_h
#include <sys/types.h>
#if defined(_WIN32)
#include <stddef.h>
#endif
/*
* Memory Allocation
*/
#if defined(HAVE_ALLOCA_H) && !defined(__GNUC__)
#include <alloca.h>
#endif
#ifndef NULL
# define NULL (void *)0
#endif
#define BUFSIZE 16384
#define S_ALLOC_N(type,n) (type*)malloc(sizeof(type)*(n))
#define S_ALLOC(type) (type*)malloc(sizeof(type))
#define S_REALLOC_N(var,type,n) (var)=(type*)realloc((char*)(var),sizeof(type)*(n))
#define S_FREE(n) free(n); n = NULL;
#define S_ALLOCA_N(type,n) (type*)alloca(sizeof(type)*(n))
#define S_MEMZERO(p,type,n) memset((p), 0, sizeof(type)*(n))
#define S_MEMCPY(p1,p2,type,n) memcpy((p1), (p2), sizeof(type)*(n))
#define S_MEMMOVE(p1,p2,type,n) memmove((p1), (p2), sizeof(type)*(n))
#define S_MEMCMP(p1,p2,type,n) memcmp((p1), (p2), sizeof(type)*(n))
typedef struct {
void *name;
void *attributes;
} hpricot_element;
typedef void (*hpricot_element_cb)(void *data, hpricot_element *token);
typedef struct hpricot_scan {
int lineno;
int cs;
size_t nread;
size_t mark;
void *data;
hpricot_element_cb xmldecl;
hpricot_element_cb doctype;
hpricot_element_cb xmlprocins;
hpricot_element_cb starttag;
hpricot_element_cb endtag;
hpricot_element_cb emptytag;
hpricot_element_cb comment;
hpricot_element_cb cdata;
} http_scan;
// int hpricot_scan_init(hpricot_scan *scan);
// int hpricot_scan_finish(hpricot_scan *scan);
// size_t hpricot_scan_execute(hpricot_scan *scan, const char *data, size_t len, size_t off);
// int hpricot_scan_has_error(hpricot_scan *scan);
// int hpricot_scan_is_finished(hpricot_scan *scan);
//
// #define hpricot_scan_nread(scan) (scan)->nread
#endif

@ -0,0 +1,363 @@
import java.io.IOException;
import org.jruby.Ruby;
import org.jruby.RubyClass;
import org.jruby.RubyHash;
import org.jruby.RubyModule;
import org.jruby.RubyNumeric;
import org.jruby.RubyString;
import org.jruby.runtime.Block;
import org.jruby.runtime.CallbackFactory;
import org.jruby.runtime.builtin.IRubyObject;
import org.jruby.exceptions.RaiseException;
import org.jruby.runtime.load.BasicLibraryService;
public class HpricotScanService implements BasicLibraryService {
public static String NO_WAY_SERIOUSLY="*** This should not happen, please send a bug report with the HTML you're parsing to why@whytheluckystiff.net. So sorry!";
public void ELE(IRubyObject N) {
if (tokend > tokstart || text) {
IRubyObject raw_string = runtime.getNil();
ele_open = false; text = false;
if (tokstart != -1 && N != cdata && N != sym_text && N != procins && N != comment) {
raw_string = runtime.newString(new String(buf,tokstart,tokend-tokstart));
}
rb_yield_tokens(N, tag[0], attr, raw_string, taint);
}
}
public void SET(IRubyObject[] N, int E) {
int mark = 0;
if(N == tag) {
if(mark_tag == -1 || E == mark_tag) {
tag[0] = runtime.newString("");
} else if(E > mark_tag) {
tag[0] = runtime.newString(new String(buf,mark_tag, E-mark_tag));
}
} else if(N == akey) {
if(mark_akey == -1 || E == mark_akey) {
akey[0] = runtime.newString("");
} else if(E > mark_akey) {
akey[0] = runtime.newString(new String(buf,mark_akey, E-mark_akey));
}
} else if(N == aval) {
if(mark_aval == -1 || E == mark_aval) {
aval[0] = runtime.newString("");
} else if(E > mark_aval) {
aval[0] = runtime.newString(new String(buf,mark_aval, E-mark_aval));
}
}
}
public void CAT(IRubyObject[] N, int E) {
if(N[0].isNil()) {
SET(N,E);
} else {
int mark = 0;
if(N == tag) {
mark = mark_tag;
} else if(N == akey) {
mark = mark_akey;
} else if(N == aval) {
mark = mark_aval;
}
((RubyString)(N[0])).append(runtime.newString(new String(buf, mark, E-mark)));
}
}
public void SLIDE(Object N) {
int mark = 0;
if(N == tag) {
mark = mark_tag;
} else if(N == akey) {
mark = mark_akey;
} else if(N == aval) {
mark = mark_aval;
}
if(mark > tokstart) {
if(N == tag) {
mark_tag -= tokstart;
} else if(N == akey) {
mark_akey -= tokstart;
} else if(N == aval) {
mark_aval -= tokstart;
}
}
}
public void ATTR(IRubyObject K, IRubyObject V) {
if(!K.isNil()) {
if(attr.isNil()) {
attr = RubyHash.newHash(runtime);
}
((RubyHash)attr).aset(K,V);
}
}
public void ATTR(IRubyObject[] K, IRubyObject V) {
ATTR(K[0],V);
}
public void ATTR(IRubyObject K, IRubyObject[] V) {
ATTR(K,V[0]);
}
public void ATTR(IRubyObject[] K, IRubyObject[] V) {
ATTR(K[0],V[0]);
}
public void TEXT_PASS() {
if(!text) {
if(ele_open) {
ele_open = false;
if(tokstart > -1) {
mark_tag = tokstart;
}
} else {
mark_tag = p;
}
attr = runtime.getNil();
tag[0] = runtime.getNil();
text = true;
}
}
public void EBLK(IRubyObject N, int T) {
CAT(tag, p - T + 1);
ELE(N);
}
public void rb_raise(RubyClass error, String message) {
throw new RaiseException(runtime, error, message, true);
}
public IRubyObject rb_str_new2(String s) {
return runtime.newString(s);
}
%%{
machine hpricot_scan;
action newEle {
if (text) {
CAT(tag, p);
ELE(sym_text);
text = false;
}
attr = runtime.getNil();
tag[0] = runtime.getNil();
mark_tag = -1;
ele_open = true;
}
action _tag { mark_tag = p; }
action _aval { mark_aval = p; }
action _akey { mark_akey = p; }
action tag { SET(tag, p); }
action tagc { SET(tag, p-1); }
action aval { SET(aval, p); }
action aunq {
if (buf[p-1] == '"' || buf[p-1] == '\'') { SET(aval, p-1); }
else { SET(aval, p); }
}
action akey { SET(akey, p); }
action xmlver { SET(aval, p); ATTR(rb_str_new2("version"), aval); }
action xmlenc { SET(aval, p); ATTR(rb_str_new2("encoding"), aval); }
action xmlsd { SET(aval, p); ATTR(rb_str_new2("standalone"), aval); }
action pubid { SET(aval, p); ATTR(rb_str_new2("public_id"), aval); }
action sysid { SET(aval, p); ATTR(rb_str_new2("system_id"), aval); }
action new_attr {
akey[0] = runtime.getNil();
aval[0] = runtime.getNil();
mark_akey = -1;
mark_aval = -1;
}
action save_attr {
ATTR(akey, aval);
}
include hpricot_common "ext/hpricot_scan/hpricot_common.rl";
}%%
%% write data nofinal;
public final static int BUFSIZE=16384;
private void rb_yield_tokens(IRubyObject sym, IRubyObject tag, IRubyObject attr, IRubyObject raw, boolean taint) {
IRubyObject ary;
if (sym == runtime.newSymbol("text")) {
raw = tag;
}
ary = runtime.newArray(new IRubyObject[]{sym, tag, attr, raw});
if (taint) {
ary.setTaint(true);
tag.setTaint(true);
attr.setTaint(true);
raw.setTaint(true);
}
block.yield(runtime.getCurrentContext(), ary, null, null, false);
}
int cs, act, have = 0, nread = 0, curline = 1, p=-1;
boolean text = false;
int tokstart=-1, tokend;
char[] buf;
Ruby runtime;
IRubyObject attr, bufsize;
IRubyObject[] tag, akey, aval;
int mark_tag, mark_akey, mark_aval;
boolean done = false, ele_open = false;
int buffer_size = 0;
boolean taint = false;
Block block = null;
IRubyObject xmldecl, doctype, procins, stag, etag, emptytag, comment,
cdata, sym_text;
IRubyObject hpricot_scan(IRubyObject recv, IRubyObject port) {
attr = bufsize = runtime.getNil();
tag = new IRubyObject[]{runtime.getNil()};
akey = new IRubyObject[]{runtime.getNil()};
aval = new IRubyObject[]{runtime.getNil()};
RubyClass rb_eHpricotParseError = runtime.getModule("Hpricot").getClass("ParseError");
taint = port.isTaint();
if ( !port.respondsTo("read")) {
if ( port.respondsTo("to_str")) {
port = port.callMethod(runtime.getCurrentContext(),"to_str");
} else {
throw runtime.newArgumentError("bad Hpricot argument, String or IO only please.");
}
}
buffer_size = BUFSIZE;
if (recv.getInstanceVariable("@buffer_size") != null) {
bufsize = recv.getInstanceVariable("@buffer_size");
if (!bufsize.isNil()) {
buffer_size = RubyNumeric.fix2int(bufsize);
}
}
buf = new char[buffer_size];
%% write init;
while( !done ) {
IRubyObject str;
p = have;
int pe;
int len, space = buffer_size - have;
if ( space == 0 ) {
/* We've used up the entire buffer storing an already-parsed token
* prefix that must be preserved. Likely caused by super-long attributes.
* See ticket #13. */
rb_raise(rb_eHpricotParseError, "ran out of buffer space on element <" + tag.toString() + ">, starting on line "+curline+".");
}
if (port.respondsTo("read")) {
str = port.callMethod(runtime.getCurrentContext(),"read",runtime.newFixnum(space));
} else {
str = ((RubyString)port).substr(nread,space);
}
str = str.convertToString();
String sss = str.toString();
char[] chars = sss.toCharArray();
System.arraycopy(chars,0,buf,p,chars.length);
len = sss.length();
nread += len;
if ( len < space ) {
len++;
done = true;
}
pe = p + len;
char[] data = buf;
%% write exec;
if ( cs == hpricot_scan_error ) {
if(!tag[0].isNil()) {
rb_raise(rb_eHpricotParseError, "parse error on element <"+tag.toString()+">, starting on line "+curline+".\n" + NO_WAY_SERIOUSLY);
} else {
rb_raise(rb_eHpricotParseError, "parse error on line "+curline+".\n" + NO_WAY_SERIOUSLY);
}
}
if ( done && ele_open ) {
ele_open = false;
if(tokstart > -1) {
mark_tag = tokstart;
tokstart = -1;
text = true;
}
}
if(tokstart == -1) {
have = 0;
/* text nodes have no tokstart because each byte is parsed alone */
if(mark_tag != -1 && text) {
if (done) {
if(mark_tag < p-1) {
CAT(tag, p-1);
ELE(sym_text);
}
} else {
CAT(tag, p);
}
}
mark_tag = 0;
} else {
have = pe - tokstart;
System.arraycopy(buf,tokstart,buf,0,have);
SLIDE(tag);
SLIDE(akey);
SLIDE(aval);
tokend = (tokend - tokstart);
tokstart = 0;
}
}
return runtime.getNil();
}
public static IRubyObject __hpricot_scan(IRubyObject recv, IRubyObject port, Block block) {
Ruby runtime = recv.getRuntime();
HpricotScanService service = new HpricotScanService();
service.runtime = runtime;
service.xmldecl = runtime.newSymbol("xmldecl");
service.doctype = runtime.newSymbol("doctype");
service.procins = runtime.newSymbol("procins");
service.stag = runtime.newSymbol("stag");
service.etag = runtime.newSymbol("etag");
service.emptytag = runtime.newSymbol("emptytag");
service.comment = runtime.newSymbol("comment");
service.cdata = runtime.newSymbol("cdata");
service.sym_text = runtime.newSymbol("text");
service.block = block;
return service.hpricot_scan(recv, port);
}
public boolean basicLoad(final Ruby runtime) throws IOException {
Init_hpricot_scan(runtime);
return true;
}
public static void Init_hpricot_scan(Ruby runtime) {
RubyModule mHpricot = runtime.defineModule("Hpricot");
mHpricot.getMetaClass().attr_accessor(new IRubyObject[]{runtime.newSymbol("buffer_size")});
CallbackFactory fact = runtime.callbackFactory(HpricotScanService.class);
mHpricot.getMetaClass().defineMethod("scan",fact.getSingletonMethod("__hpricot_scan",IRubyObject.class));
mHpricot.defineClassUnder("ParseError",runtime.getClass("Exception"),runtime.getClass("Exception").getAllocator());
}
}

@ -0,0 +1,273 @@
/*
* hpricot_scan.rl
*
* $Author: why $
* $Date: 2006-05-08 22:03:50 -0600 (Mon, 08 May 2006) $
*
* Copyright (C) 2006 why the lucky stiff
*/
#include <ruby.h>
#define NO_WAY_SERIOUSLY "*** This should not happen, please send a bug report with the HTML you're parsing to why@whytheluckystiff.net. So sorry!"
static VALUE sym_xmldecl, sym_doctype, sym_procins, sym_stag, sym_etag, sym_emptytag, sym_comment,
sym_cdata, sym_text;
static VALUE rb_eHpricotParseError;
static ID s_read, s_to_str;
#define ELE(N) \
if (tokend > tokstart || text == 1) { \
VALUE raw_string = Qnil; \
ele_open = 0; text = 0; \
if (tokstart != 0 && sym_##N != sym_cdata && sym_##N != sym_text && sym_##N != sym_procins && sym_##N != sym_comment) { \
raw_string = rb_str_new(tokstart, tokend-tokstart); \
} \
rb_yield_tokens(sym_##N, tag, attr, raw_string, taint); \
}
#define SET(N, E) \
if (mark_##N == NULL || E == mark_##N) \
N = rb_str_new2(""); \
else if (E > mark_##N) \
N = rb_str_new(mark_##N, E - mark_##N);
#define CAT(N, E) if (NIL_P(N)) { SET(N, E); } else { rb_str_cat(N, mark_##N, E - mark_##N); }
#define SLIDE(N) if ( mark_##N > tokstart ) mark_##N = buf + (mark_##N - tokstart);
#define ATTR(K, V) \
if (!NIL_P(K)) { \
if (NIL_P(attr)) attr = rb_hash_new(); \
rb_hash_aset(attr, K, V); \
}
#define TEXT_PASS() \
if (text == 0) \
{ \
if (ele_open == 1) { \
ele_open = 0; \
if (tokstart > 0) { \
mark_tag = tokstart; \
} \
} else { \
mark_tag = p; \
} \
attr = Qnil; \
tag = Qnil; \
text = 1; \
}
#define EBLK(N, T) CAT(tag, p - T + 1); ELE(N);
%%{
machine hpricot_scan;
action newEle {
if (text == 1) {
CAT(tag, p);
ELE(text);
text = 0;
}
attr = Qnil;
tag = Qnil;
mark_tag = NULL;
ele_open = 1;
}
action _tag { mark_tag = p; }
action _aval { mark_aval = p; }
action _akey { mark_akey = p; }
action tag { SET(tag, p); }
action tagc { SET(tag, p-1); }
action aval { SET(aval, p); }
action aunq {
if (*(p-1) == '"' || *(p-1) == '\'') { SET(aval, p-1); }
else { SET(aval, p); }
}
action akey { SET(akey, p); }
action xmlver { SET(aval, p); ATTR(rb_str_new2("version"), aval); }
action xmlenc { SET(aval, p); ATTR(rb_str_new2("encoding"), aval); }
action xmlsd { SET(aval, p); ATTR(rb_str_new2("standalone"), aval); }
action pubid { SET(aval, p); ATTR(rb_str_new2("public_id"), aval); }
action sysid { SET(aval, p); ATTR(rb_str_new2("system_id"), aval); }
action new_attr {
akey = Qnil;
aval = Qnil;
mark_akey = NULL;
mark_aval = NULL;
}
action save_attr {
ATTR(akey, aval);
}
include hpricot_common "ext/hpricot_scan/hpricot_common.rl";
}%%
%% write data nofinal;
#define BUFSIZE 16384
void rb_yield_tokens(VALUE sym, VALUE tag, VALUE attr, VALUE raw, int taint)
{
VALUE ary;
if (sym == sym_text) {
raw = tag;
}
ary = rb_ary_new3(4, sym, tag, attr, raw);
if (taint) {
OBJ_TAINT(ary);
OBJ_TAINT(tag);
OBJ_TAINT(attr);
OBJ_TAINT(raw);
}
rb_yield(ary);
}
VALUE hpricot_scan(VALUE self, VALUE port)
{
int cs, act, have = 0, nread = 0, curline = 1, text = 0;
char *tokstart = 0, *tokend = 0, *buf = NULL;
VALUE attr = Qnil, tag = Qnil, akey = Qnil, aval = Qnil, bufsize = Qnil;
char *mark_tag = 0, *mark_akey = 0, *mark_aval = 0;
int done = 0, ele_open = 0, buffer_size = 0;
int taint = OBJ_TAINTED( port );
if ( !rb_respond_to( port, s_read ) )
{
if ( rb_respond_to( port, s_to_str ) )
{
port = rb_funcall( port, s_to_str, 0 );
StringValue(port);
}
else
{
rb_raise( rb_eArgError, "bad Hpricot argument, String or IO only please." );
}
}
buffer_size = BUFSIZE;
if (rb_ivar_defined(self, rb_intern("@buffer_size")) == Qtrue) {
bufsize = rb_ivar_get(self, rb_intern("@buffer_size"));
if (!NIL_P(bufsize)) {
buffer_size = NUM2INT(bufsize);
}
}
buf = ALLOC_N(char, buffer_size);
%% write init;
while ( !done ) {
VALUE str;
char *p = buf + have, *pe;
int len, space = buffer_size - have;
if ( space == 0 ) {
/* We've used up the entire buffer storing an already-parsed token
* prefix that must be preserved. Likely caused by super-long attributes.
* See ticket #13. */
rb_raise(rb_eHpricotParseError, "ran out of buffer space on element <%s>, starting on line %d.", RSTRING(tag)->ptr, curline);
}
if ( rb_respond_to( port, s_read ) )
{
str = rb_funcall( port, s_read, 1, INT2FIX(space) );
}
else
{
str = rb_str_substr( port, nread, space );
}
StringValue(str);
memcpy( p, RSTRING(str)->ptr, RSTRING(str)->len );
len = RSTRING(str)->len;
nread += len;
/* If this is the last buffer, tack on an EOF. */
if ( len < space ) {
p[len++] = 0;
done = 1;
}
pe = p + len;
%% write exec;
if ( cs == hpricot_scan_error ) {
free(buf);
if ( !NIL_P(tag) )
{
rb_raise(rb_eHpricotParseError, "parse error on element <%s>, starting on line %d.\n" NO_WAY_SERIOUSLY, RSTRING(tag)->ptr, curline);
}
else
{
rb_raise(rb_eHpricotParseError, "parse error on line %d.\n" NO_WAY_SERIOUSLY, curline);
}
}
if ( done && ele_open )
{
ele_open = 0;
if (tokstart > 0) {
mark_tag = tokstart;
tokstart = 0;
text = 1;
}
}
if ( tokstart == 0 )
{
have = 0;
/* text nodes have no tokstart because each byte is parsed alone */
if ( mark_tag != NULL && text == 1 )
{
if (done)
{
if (mark_tag < p-1)
{
CAT(tag, p-1);
ELE(text);
}
}
else
{
CAT(tag, p);
}
}
mark_tag = buf;
}
else
{
have = pe - tokstart;
memmove( buf, tokstart, have );
SLIDE(tag);
SLIDE(akey);
SLIDE(aval);
tokend = buf + (tokend - tokstart);
tokstart = buf;
}
}
free(buf);
}
void Init_hpricot_scan()
{
VALUE mHpricot = rb_define_module("Hpricot");
rb_define_attr(rb_singleton_class(mHpricot), "buffer_size", 1, 1);
rb_define_singleton_method(mHpricot, "scan", hpricot_scan, 1);
rb_eHpricotParseError = rb_define_class_under(mHpricot, "ParseError", rb_eException);
s_read = rb_intern("read");
s_to_str = rb_intern("to_str");
sym_xmldecl = ID2SYM(rb_intern("xmldecl"));
sym_doctype = ID2SYM(rb_intern("doctype"));
sym_procins = ID2SYM(rb_intern("procins"));
sym_stag = ID2SYM(rb_intern("stag"));
sym_etag = ID2SYM(rb_intern("etag"));
sym_emptytag = ID2SYM(rb_intern("emptytag"));
sym_comment = ID2SYM(rb_intern("comment"));
sym_cdata = ID2SYM(rb_intern("cdata"));
sym_text = ID2SYM(rb_intern("text"));
}

@ -0,0 +1,176 @@
# This rbconfig.rb corresponds to a Ruby installation for win32 cross-compiled
# with mingw under i686-linux. It can be used to cross-compile extensions for
# win32 using said toolchain.
#
# This file assumes that a cross-compiled mingw32 build (compatible with the
# mswin32 builds) is installed under $HOME/ruby-mingw32.
module Config
#RUBY_VERSION == "1.8.5" or
# raise "ruby lib version (1.8.5) doesn't match executable version (#{RUBY_VERSION})"
mingw32 = ENV['MINGW32_RUBY'] || "#{ENV["HOME"]}/ruby-mingw32"
mingwpre = ENV['MINGW32_PREFIX']
TOPDIR = File.dirname(__FILE__).chomp!("/lib/ruby/1.8/i386-mingw32")
DESTDIR = '' unless defined? DESTDIR
CONFIG = {}
CONFIG["DESTDIR"] = DESTDIR
CONFIG["INSTALL"] = "/usr/bin/install -c"
CONFIG["prefix"] = (TOPDIR || DESTDIR + mingw32)
CONFIG["EXEEXT"] = ".exe"
CONFIG["ruby_install_name"] = "ruby"
CONFIG["RUBY_INSTALL_NAME"] = "ruby"
CONFIG["RUBY_SO_NAME"] = "msvcrt-ruby18"
CONFIG["SHELL"] = "/bin/sh"
CONFIG["PATH_SEPARATOR"] = ":"
CONFIG["PACKAGE_NAME"] = ""
CONFIG["PACKAGE_TARNAME"] = ""
CONFIG["PACKAGE_VERSION"] = ""
CONFIG["PACKAGE_STRING"] = ""
CONFIG["PACKAGE_BUGREPORT"] = ""
CONFIG["exec_prefix"] = "$(prefix)"
CONFIG["bindir"] = "$(exec_prefix)/bin"
CONFIG["sbindir"] = "$(exec_prefix)/sbin"
CONFIG["libexecdir"] = "$(exec_prefix)/libexec"
CONFIG["datadir"] = "$(prefix)/share"
CONFIG["sysconfdir"] = "$(prefix)/etc"
CONFIG["sharedstatedir"] = "$(prefix)/com"
CONFIG["localstatedir"] = "$(prefix)/var"
CONFIG["libdir"] = "$(exec_prefix)/lib"
CONFIG["includedir"] = "$(prefix)/include"
CONFIG["oldincludedir"] = "/usr/include"
CONFIG["infodir"] = "$(prefix)/info"
CONFIG["mandir"] = "$(prefix)/man"
CONFIG["build_alias"] = "i686-linux"
CONFIG["host_alias"] = "#{mingwpre}"
CONFIG["target_alias"] = "i386-mingw32"
CONFIG["ECHO_C"] = ""
CONFIG["ECHO_N"] = "-n"
CONFIG["ECHO_T"] = ""
CONFIG["LIBS"] = "-lwsock32 "
CONFIG["MAJOR"] = "1"
CONFIG["MINOR"] = "8"
CONFIG["TEENY"] = "4"
CONFIG["build"] = "i686-pc-linux"
CONFIG["build_cpu"] = "i686"
CONFIG["build_vendor"] = "pc"
CONFIG["build_os"] = "linux"
CONFIG["host"] = "i586-pc-mingw32msvc"
CONFIG["host_cpu"] = "i586"
CONFIG["host_vendor"] = "pc"
CONFIG["host_os"] = "mingw32msvc"
CONFIG["target"] = "i386-pc-mingw32"
CONFIG["target_cpu"] = "i386"
CONFIG["target_vendor"] = "pc"
CONFIG["target_os"] = "mingw32"
CONFIG["CC"] = "#{mingwpre}-gcc"
CONFIG["CFLAGS"] = "-g -O2 "
CONFIG["LDFLAGS"] = ""
CONFIG["CPPFLAGS"] = ""
CONFIG["OBJEXT"] = "o"
CONFIG["CPP"] = "#{mingwpre}-gcc -E"
CONFIG["EGREP"] = "grep -E"
CONFIG["GNU_LD"] = "yes"
CONFIG["CPPOUTFILE"] = "-o conftest.i"
CONFIG["OUTFLAG"] = "-o "
CONFIG["YACC"] = "bison -y"
CONFIG["RANLIB"] = "#{mingwpre}-ranlib"
CONFIG["AR"] = "#{mingwpre}-ar"
CONFIG["NM"] = "#{mingwpre}-nm"
CONFIG["WINDRES"] = "#{mingwpre}-windres"
CONFIG["DLLWRAP"] = "#{mingwpre}-dllwrap"
CONFIG["OBJDUMP"] = "#{mingwpre}-objdump"
CONFIG["LN_S"] = "ln -s"
CONFIG["SET_MAKE"] = ""
CONFIG["INSTALL_PROGRAM"] = "$(INSTALL)"
CONFIG["INSTALL_SCRIPT"] = "$(INSTALL)"
CONFIG["INSTALL_DATA"] = "$(INSTALL) -m 644"
CONFIG["RM"] = "rm -f"
CONFIG["CP"] = "cp"
CONFIG["MAKEDIRS"] = "mkdir -p"
CONFIG["LIBOBJS"] = " fileblocks$(U).o crypt$(U).o flock$(U).o acosh$(U).o win32$(U).o"
CONFIG["ALLOCA"] = ""
CONFIG["DLDFLAGS"] = " -Wl,--enable-auto-import,--export-all"
CONFIG["ARCH_FLAG"] = ""
CONFIG["STATIC"] = ""
CONFIG["CCDLFLAGS"] = ""
CONFIG["LDSHARED"] = "#{mingwpre}-gcc -shared -s"
CONFIG["DLEXT"] = "so"
CONFIG["DLEXT2"] = "dll"
CONFIG["LIBEXT"] = "a"
CONFIG["LINK_SO"] = ""
CONFIG["LIBPATHFLAG"] = " -L\"%s\""
CONFIG["RPATHFLAG"] = ""
CONFIG["LIBPATHENV"] = ""
CONFIG["TRY_LINK"] = ""
CONFIG["STRIP"] = "strip"
CONFIG["EXTSTATIC"] = ""
CONFIG["setup"] = "Setup"
CONFIG["MINIRUBY"] = "ruby -rfake"
CONFIG["PREP"] = "fake.rb"
CONFIG["RUNRUBY"] = "$(MINIRUBY) -I`cd $(srcdir)/lib; pwd`"
CONFIG["EXTOUT"] = ".ext"
CONFIG["ARCHFILE"] = ""
CONFIG["RDOCTARGET"] = ""
CONFIG["XCFLAGS"] = " -DRUBY_EXPORT"
CONFIG["XLDFLAGS"] = " -Wl,--stack,0x02000000 -L."
CONFIG["LIBRUBY_LDSHARED"] = "#{mingwpre}-gcc -shared -s"
CONFIG["LIBRUBY_DLDFLAGS"] = " -Wl,--enable-auto-import,--export-all -Wl,--out-implib=$(LIBRUBY)"
CONFIG["rubyw_install_name"] = "rubyw"
CONFIG["RUBYW_INSTALL_NAME"] = "rubyw"
CONFIG["LIBRUBY_A"] = "lib$(RUBY_SO_NAME)-static.a"
CONFIG["LIBRUBY_SO"] = "$(RUBY_SO_NAME).dll"
CONFIG["LIBRUBY_ALIASES"] = ""
CONFIG["LIBRUBY"] = "lib$(LIBRUBY_SO).a"
CONFIG["LIBRUBYARG"] = "$(LIBRUBYARG_SHARED)"
CONFIG["LIBRUBYARG_STATIC"] = "-l$(RUBY_SO_NAME)-static"
CONFIG["LIBRUBYARG_SHARED"] = "-l$(RUBY_SO_NAME)"
CONFIG["SOLIBS"] = "$(LIBS)"
CONFIG["DLDLIBS"] = ""
CONFIG["ENABLE_SHARED"] = "yes"
CONFIG["MAINLIBS"] = ""
CONFIG["COMMON_LIBS"] = "m"
CONFIG["COMMON_MACROS"] = ""
CONFIG["COMMON_HEADERS"] = "windows.h winsock.h"
CONFIG["EXPORT_PREFIX"] = ""
CONFIG["MINIOBJS"] = "dmydln.o"
CONFIG["MAKEFILES"] = "Makefile GNUmakefile"
CONFIG["arch"] = "i386-mingw32"
CONFIG["sitearch"] = "i386-msvcrt"
CONFIG["sitedir"] = "$(prefix)/lib/ruby/site_ruby"
CONFIG["configure_args"] = "'--host=#{mingwpre}' '--target=i386-mingw32' '--build=i686-linux' '--prefix=#{mingw32}' 'build_alias=i686-linux' 'host_alias=#{mingwpre}' 'target_alias=i386-mingw32'"
CONFIG["NROFF"] = "/usr/bin/nroff"
CONFIG["MANTYPE"] = "doc"
CONFIG["LTLIBOBJS"] = " fileblocks$(U).lo crypt$(U).lo flock$(U).lo acosh$(U).lo win32$(U).lo"
CONFIG["ruby_version"] = "$(MAJOR).$(MINOR)"
CONFIG["rubylibdir"] = "$(libdir)/ruby/$(ruby_version)"
CONFIG["archdir"] = "$(rubylibdir)/$(arch)"
CONFIG["sitelibdir"] = "$(sitedir)/$(ruby_version)"
CONFIG["sitearchdir"] = "$(sitelibdir)/$(sitearch)"
CONFIG["topdir"] = File.dirname(__FILE__)
MAKEFILE_CONFIG = {}
CONFIG.each{|k,v| MAKEFILE_CONFIG[k] = v.dup}
def Config::expand(val, config = CONFIG)
val.gsub!(/\$\$|\$\(([^()]+)\)|\$\{([^{}]+)\}/) do |var|
if !(v = $1 || $2)
'$'
elsif key = config[v = v[/\A[^:]+(?=(?::(.*?)=(.*))?\z)/]]
pat, sub = $1, $2
config[v] = false
Config::expand(key, config)
config[v] = key
key = key.gsub(/#{Regexp.quote(pat)}(?=\s|\z)/n) {sub} if pat
key
else
var
end
end
val
end
CONFIG.each_value do |val|
Config::expand(val)
end
end
RbConfig = Config # compatibility for ruby-1.9
CROSS_COMPILING = nil unless defined? CROSS_COMPILING

@ -0,0 +1,3 @@
require File.join(File.dirname(__FILE__), 'lib', 'hpricot')

@ -0,0 +1,26 @@
# == About hpricot.rb
#
# All of Hpricot's various part are loaded when you use <tt>require 'hpricot'</tt>.
#
# * hpricot_scan: the scanner (a C extension for Ruby) which turns an HTML stream into tokens.
# * hpricot/parse.rb: uses the scanner to sort through tokens and give you back a complete document object.
# * hpricot/tag.rb: sets up objects for the various types of elements in an HTML document.
# * hpricot/modules.rb: categorizes the various elements using mixins.
# * hpricot/traverse.rb: methods for searching documents.
# * hpricot/elements.rb: methods for dealing with a group of elements as an Hpricot::Elements list.
# * hpricot/inspect.rb: methods for displaying documents in a readable form.
# If available, Nikolai's UTF-8 library will ease use of utf-8 documents.
# See http://git.bitwi.se/ruby-character-encodings.git/.
begin
require 'encoding/character/utf-8'
rescue LoadError
end
require 'hpricot_scan'
require 'hpricot/tag'
require 'hpricot/modules'
require 'hpricot/traverse'
require 'hpricot/inspect'
require 'hpricot/parse'
require 'hpricot/builder'

@ -0,0 +1,63 @@
#!/usr/bin/env ruby
#--
# Copyright 2004 by Jim Weirich (jim@weirichhouse.org).
# All rights reserved.
# Permission is granted for use, copying, modification, distribution,
# and distribution of modified versions of this work as long as the
# above copyright notice is included.
#++
module Hpricot
# BlankSlate provides an abstract base class with no predefined
# methods (except for <tt>\_\_send__</tt> and <tt>\_\_id__</tt>).
# BlankSlate is useful as a base class when writing classes that
# depend upon <tt>method_missing</tt> (e.g. dynamic proxies).
class BlankSlate
class << self
# Hide the method named +name+ in the BlankSlate class. Don't
# hide +instance_eval+ or any method beginning with "__".
def hide(name)
undef_method name if
instance_methods.include?(name.to_s) and
name !~ /^(__|instance_eval)/
end
end
instance_methods.each { |m| hide(m) }
end
end
# Since Ruby is very dynamic, methods added to the ancestors of
# BlankSlate <em>after BlankSlate is defined</em> will show up in the
# list of available BlankSlate methods. We handle this by defining a
# hook in the Object and Kernel classes that will hide any defined
module Kernel
class << self
alias_method :hpricot_slate_method_added, :method_added
# Detect method additions to Kernel and remove them in the
# BlankSlate class.
def method_added(name)
hpricot_slate_method_added(name)
return if self != Kernel
Hpricot::BlankSlate.hide(name)
end
end
end
class Object
class << self
alias_method :hpricot_slate_method_added, :method_added
# Detect method additions to Object and remove them in the
# BlankSlate class.
def method_added(name)
hpricot_slate_method_added(name)
return if self != Object
Hpricot::BlankSlate.hide(name)
end
end
end

@ -0,0 +1,200 @@
require 'hpricot/tags'
require 'hpricot/xchar'
require 'hpricot/blankslate'
module Hpricot
def self.build(ele = Doc.new, assigns = {}, &blk)
ele.extend Builder
assigns.each do |k, v|
ele.instance_variable_set("@#{k}", v)
end
ele.instance_eval &blk
ele
end
module Builder
@@default = {
:indent => 0,
:output_helpers => true,
:output_xml_instruction => true,
:output_meta_tag => true,
:auto_validation => true,
:tagset => Hpricot::XHTMLTransitional,
:root_attributes => {
:xmlns => 'http://www.w3.org/1999/xhtml', :'xml:lang' => 'en', :lang => 'en'
}
}
def self.set(option, value)
@@default[option] = value
end
# Write a +string+ to the HTML stream, making sure to escape it.
def text!(string)
@children << Text.new(Hpricot.xs(string))
end
# Write a +string+ to the HTML stream without escaping it.
def text(string)
@children << Text.new(string)
nil
end
alias_method :<<, :text
alias_method :concat, :text
# Create a tag named +tag+. Other than the first argument which is the tag name,
# the arguments are the same as the tags implemented via method_missing.
def tag!(tag, *args, &block)
ele_id = nil
if @auto_validation and @tagset
if !@tagset.tagset.has_key?(tag)
raise InvalidXhtmlError, "no element `#{tag}' for #{tagset.doctype}"
elsif args.last.respond_to?(:to_hash)
attrs = args.last.to_hash
if @tagset.forms.include?(tag) and attrs[:id]
attrs[:name] ||= attrs[:id]
end
attrs.each do |k, v|
atname = k.to_s.downcase.intern
unless k =~ /:/ or @tagset.tagset[tag].include? atname
raise InvalidXhtmlError, "no attribute `#{k}' on #{tag} elements"
end
if atname == :id
ele_id = v.to_s
if @elements.has_key? ele_id
raise InvalidXhtmlError, "id `#{ele_id}' already used (id's must be unique)."
end
end
end
end
end
# turn arguments into children or attributes
childs = []
attrs = args.grep(Hash)
childs.concat((args - attrs).map do |x|
if x.respond_to? :to_html
Hpricot.make(x.to_html)
elsif x
Text.new(Hpricot.xs(x))
end
end.flatten)
attrs = attrs.inject({}) do |hsh, ath|
ath.each do |k, v|
hsh[k] = Hpricot.xs(v.to_s) if v
end
hsh
end
# create the element itself
f = Elem.new(STag.new(tag, attrs), childs, ETag.new(tag))
# build children from the block
if block
build(f, &block)
end
@children << f
f
end
def build(*a, &b)
Hpricot.build(*a, &b)
end
# Every HTML tag method goes through an html_tag call. So, calling <tt>div</tt> is equivalent
# to calling <tt>html_tag(:div)</tt>. All HTML tags in Hpricot's list are given generated wrappers
# for this method.
#
# If the @auto_validation setting is on, this method will check for many common mistakes which
# could lead to invalid XHTML.
def html_tag(sym, *args, &block)
if @auto_validation and @tagset.self_closing.include?(sym) and block
raise InvalidXhtmlError, "the `#{sym}' element is self-closing, please remove the block"
elsif args.empty? and block.nil?
CssProxy.new(self, sym)
else
tag!(sym, *args, &block)
end
end
XHTMLTransitional.tags.each do |k|
class_eval %{
def #{k}(*args, &block)
html_tag(#{k.inspect}, *args, &block)
end
}
end
def doctype(target, pub, sys)
@children << DocType.new(target, pub, sys)
end
remove_method :head
# Builds a head tag. Adds a <tt>meta</tt> tag inside with Content-Type
# set to <tt>text/html; charset=utf-8</tt>.
def head(*args, &block)
tag!(:head, *args) do
tag!(:meta, "http-equiv" => "Content-Type", "content" => "text/html; charset=utf-8") if @output_meta_tag
instance_eval(&block)
end
end
# Builds an html tag. An XML 1.0 instruction and an XHTML 1.0 Transitional doctype
# are prepended. Also assumes <tt>:xmlns => "http://www.w3.org/1999/xhtml",
# :lang => "en"</tt>.
def xhtml_transitional(attrs = {}, &block)
# self.tagset = Hpricot::XHTMLTransitional
xhtml_html(attrs, &block)
end
# Builds an html tag with XHTML 1.0 Strict doctype instead.
def xhtml_strict(attrs = {}, &block)
# self.tagset = Hpricot::XHTMLStrict
xhtml_html(attrs, &block)
end
private
def xhtml_html(attrs = {}, &block)
instruct! if @output_xml_instruction
doctype(:html, *@@default[:tagset].doctype)
tag!(:html, @@default[:root_attributes].merge(attrs), &block)
end
end
# Class used by Markaby::Builder to store element options. Methods called
# against the CssProxy object are added as element classes or IDs.
#
# See the README for examples.
class CssProxy < BlankSlate
# Creates a CssProxy object.
def initialize(builder, sym)
@builder, @sym, @attrs = builder, sym, {}
end
# Adds attributes to an element. Bang methods set the :id attribute.
# Other methods add to the :class attribute.
def method_missing(id_or_class, *args, &block)
if (idc = id_or_class.to_s) =~ /!$/
@attrs[:id] = $`
else
@attrs[:class] = @attrs[:class].nil? ? idc : "#{@attrs[:class]} #{idc}".strip
end
if block or args.any?
args.push(@attrs)
return @builder.tag!(@sym, *args, &block)
end
return self
end
end
end

@ -0,0 +1,510 @@
module Hpricot
# Once you've matched a list of elements, you will often need to handle them as
# a group. Or you may want to perform the same action on each of them.
# Hpricot::Elements is an extension of Ruby's array class, with some methods
# added for altering elements contained in the array.
#
# If you need to create an element array from regular elements:
#
# Hpricot::Elements[ele1, ele2, ele3]
#
# Assuming that ele1, ele2 and ele3 contain element objects (Hpricot::Elem,
# Hpricot::Doc, etc.)
#
# == Continuing Searches
#
# Usually the Hpricot::Elements you're working on comes from a search you've
# done. Well, you can continue searching the list by using the same <tt>at</tt>
# and <tt>search</tt> methods you can use on plain elements.
#
# elements = doc.search("/div/p")
# elements = elements.search("/a[@href='http://hoodwink.d/']")
# elements = elements.at("img")
#
# == Altering Elements
#
# When you're altering elements in the list, your changes will be reflected in
# the document you started searching from.
#
# doc = Hpricot("That's my <b>spoon</b>, Tyler.")
# doc.at("b").swap("<i>fork</i>")
# doc.to_html
# #=> "That's my <i>fork</i>, Tyler."
#
# == Getting More Detailed
#
# If you can't find a method here that does what you need, you may need to
# loop through the elements and find a method in Hpricot::Container::Trav
# which can do what you need.
#
# For example, you may want to search for all the H3 header tags in a document
# and grab all the tags underneath the header, but not inside the header.
# A good method for this is <tt>next_sibling</tt>:
#
# doc.search("h3").each do |h3|
# while ele = h3.next_sibling
# ary << ele # stuff away all the elements under the h3
# end
# end
#
# Most of the useful element methods are in the mixins Hpricot::Traverse
# and Hpricot::Container::Trav.
class Elements < Array
# Searches this list for any elements (or children of these elements) matching
# the CSS or XPath expression +expr+. Root is assumed to be the element scanned.
#
# See Hpricot::Container::Trav.search for more.
def search(*expr,&blk)
Elements[*map { |x| x.search(*expr,&blk) }.flatten.uniq]
end
alias_method :/, :search
# Searches this list for the first element (or child of these elements) matching
# the CSS or XPath expression +expr+. Root is assumed to be the element scanned.
#
# See Hpricot::Container::Trav.at for more.
def at(expr, &blk)
search(expr, &blk).first
end
alias_method :%, :at
# Convert this group of elements into a complete HTML fragment, returned as a
# string.
def to_html
map { |x| x.output("") }.join
end
alias_method :to_s, :to_html
# Returns an HTML fragment built of the contents of each element in this list.
#
# If a HTML +string+ is supplied, this method acts like inner_html=.
def inner_html(*string)
if string.empty?
map { |x| x.inner_html }.join
else
x = self.inner_html = string.pop || x
end
end
alias_method :html, :inner_html
alias_method :innerHTML, :inner_html
# Replaces the contents of each element in this list. Supply an HTML +string+,
# which is loaded into Hpricot objects and inserted into every element in this
# list.
def inner_html=(string)
each { |x| x.inner_html = string }
end
alias_method :html=, :inner_html=
alias_method :innerHTML=, :inner_html=
# Returns an string containing the text contents of each element in this list.
# All HTML tags are removed.
def inner_text
map { |x| x.inner_text }.join
end
alias_method :text, :inner_text
# Remove all elements in this list from the document which contains them.
#
# doc = Hpricot("<html>Remove this: <b>here</b></html>")
# doc.search("b").remove
# doc.to_html
# => "<html>Remove this: </html>"
#
def remove
each { |x| x.parent.children.delete(x) }
end
# Empty the elements in this list, by removing their insides.
#
# doc = Hpricot("<p> We have <i>so much</i> to say.</p>")
# doc.search("i").empty
# doc.to_html
# => "<p> We have <i></i> to say.</p>"
#
def empty
each { |x| x.inner_html = nil }
end
# Add to the end of the contents inside each element in this list.
# Pass in an HTML +str+, which is turned into Hpricot elements.
def append(str = nil, &blk)
each { |x| x.html(x.children + Hpricot.make(str, &blk)) }
end
# Add to the start of the contents inside each element in this list.
# Pass in an HTML +str+, which is turned into Hpricot elements.
def prepend(str = nil, &blk)
each { |x| x.html(Hpricot.make(str, &blk) + x.children) }
end
# Add some HTML just previous to each element in this list.
# Pass in an HTML +str+, which is turned into Hpricot elements.
def before(str = nil, &blk)
each { |x| x.parent.insert_before Hpricot.make(str, &blk), x }
end
# Just after each element in this list, add some HTML.
# Pass in an HTML +str+, which is turned into Hpricot elements.
def after(str = nil, &blk)
each { |x| x.parent.insert_after Hpricot.make(str, &blk), x }
end
# Wraps each element in the list inside the element created by HTML +str+.
# If more than one element is found in the string, Hpricot locates the
# deepest spot inside the first element.
#
# doc.search("a[@href]").
# wrap(%{<div class="link"><div class="link_inner"></div></div>})
#
# This code wraps every link on the page inside a +div.link+ and a +div.link_inner+ nest.
def wrap(str = nil, &blk)
each do |x|
wrap = Hpricot.make(str, &blk)
nest = wrap.detect { |w| w.respond_to? :children }
unless nest
raise Exception, "No wrapping element found."
end
x.parent.replace_child(x, wrap)
nest = nest.children.first until nest.empty?
nest.html(nest.children + [x])
end
end
# Gets and sets attributes on all matched elements.
#
# Pass in a +key+ on its own and this method will return the string value
# assigned to that attribute for the first elements. Or +nil+ if the
# attribute isn't found.
#
# doc.search("a").attr("href")
# #=> "http://hacketyhack.net/"
#
# Or, pass in a +key+ and +value+. This will set an attribute for all
# matched elements.
#
# doc.search("p").attr("class", "basic")
#
# You may also use a Hash to set a series of attributes:
#
# (doc/"a").attr(:class => "basic", :href => "http://hackety.org/")
#
# Lastly, a block can be used to rewrite an attribute based on the element
# it belongs to. The block will pass in an element. Return from the block
# the new value of the attribute.
#
# records.attr("href") { |e| e['href'] + "#top" }
#
# This example adds a <tt>#top</tt> anchor to each link.
#
def attr key, value = nil, &blk
if value or blk
each do |el|
el.set_attribute(key, value || blk[el])
end
return self
end
if key.is_a? Hash
key.each { |k,v| self.attr(k,v) }
return self
else
return self[0].get_attribute(key)
end
end
alias_method :set, :attr
# Adds the class to all matched elements.
#
# (doc/"p").add_class("bacon")
#
# Now all paragraphs will have class="bacon".
def add_class class_name
each do |el|
next unless el.respond_to? :get_attribute
classes = el.get_attribute('class').to_s.split(" ")
el.set_attribute('class', classes.push(class_name).uniq.join(" "))
end
self
end
# Remove an attribute from each of the matched elements.
#
# (doc/"input").remove_attr("disabled")
#
def remove_attr name
each do |el|
next unless el.respond_to? :remove_attribute
el.remove_attribute(name)
end
self
end
# Removes a class from all matched elements.
#
# (doc/"span").remove_class("lightgrey")
#
# Or, to remove all classes:
#
# (doc/"span").remove_class
#
def remove_class name = nil
each do |el|
next unless el.respond_to? :get_attribute
if name
classes = el.get_attribute('class').to_s.split(" ")
el.set_attribute('class', (classes - [name]).uniq.join(" "))
else
el.remove_attribute("class")
end
end
self
end
ATTR_RE = %r!\[ *(?:(@)([\w\(\)-]+)|([\w\(\)-]+\(\))) *([~\!\|\*$\^=]*) *'?"?([^\]'"]*)'?"? *\]!i
BRACK_RE = %r!(\[) *([^\]]*) *\]+!i
FUNC_RE = %r!(:)?([a-zA-Z0-9\*_-]*)\( *[\"']?([^ \)]*?)['\"]? *\)!
CUST_RE = %r!(:)([a-zA-Z0-9\*_-]*)()!
CATCH_RE = %r!([:\.#]*)([a-zA-Z0-9\*_-]+)!
def self.filter(nodes, expr, truth = true)
until expr.empty?
_, *m = *expr.match(/^(?:#{ATTR_RE}|#{BRACK_RE}|#{FUNC_RE}|#{CUST_RE}|#{CATCH_RE})/)
break unless _
expr = $'
m.compact!
if m[0] == '@'
m[0] = "@#{m.slice!(2,1)}"
end
if m[0] == '[' && m[1] =~ /^\d+$/
m = [":", "nth", m[1].to_i-1]
end
if m[0] == ":" && m[1] == "not"
nodes, = Elements.filter(nodes, m[2], false)
elsif "#{m[0]}#{m[1]}" =~ /^(:even|:odd)$/
new_nodes = []
nodes.each_with_index {|n,i| new_nodes.push(n) if (i % 2 == (m[1] == "even" ? 0 : 1)) }
nodes = new_nodes
elsif "#{m[0]}#{m[1]}" =~ /^(:first|:last)$/
nodes = [nodes.send(m[1])]
else
meth = "filter[#{m[0]}#{m[1]}]" unless m[0].empty?
if meth and Traverse.method_defined? meth
args = m[2..-1]
else
meth = "filter[#{m[0]}]"
if Traverse.method_defined? meth
args = m[1..-1]
end
end
i = -1
nodes = Elements[*nodes.find_all do |x|
i += 1
x.send(meth, *([*args] + [i])) ? truth : !truth
end]
end
end
[nodes, expr]
end
# Given two elements, attempt to gather an Elements array of everything between
# (and including) those two elements.
def self.expand(ele1, ele2, excl=false)
ary = []
offset = excl ? -1 : 0
if ele1 and ele2
# let's quickly take care of siblings
if ele1.parent == ele2.parent
ary = ele1.parent.children[ele1.node_position..(ele2.node_position+offset)]
else
# find common parent
p, ele1_p = ele1, [ele1]
ele1_p.unshift p while p.respond_to?(:parent) and p = p.parent
p, ele2_p = ele2, [ele2]
ele2_p.unshift p while p.respond_to?(:parent) and p = p.parent
common_parent = ele1_p.zip(ele2_p).select { |p1, p2| p1 == p2 }.flatten.last
child = nil
if ele1 == common_parent
child = ele2
elsif ele2 == common_parent
child = ele1
end
if child
ary = common_parent.children[0..(child.node_position+offset)]
end
end
end
return Elements[*ary]
end
def filter(expr)
nodes, = Elements.filter(self, expr)
nodes
end
def not(expr)
if expr.is_a? Traverse
nodes = self - [expr]
else
nodes, = Elements.filter(self, expr, false)
end
nodes
end
private
def copy_node(node, l)
l.instance_variables.each do |iv|
node.instance_variable_set(iv, l.instance_variable_get(iv))
end
end
end
module Traverse
def self.filter(tok, &blk)
define_method("filter[#{tok.is_a?(String) ? tok : tok.inspect}]", &blk)
end
filter '' do |name,i|
name == '*' || (self.respond_to?(:name) && self.name.downcase == name.downcase)
end
filter '#' do |id,i|
self.elem? and get_attribute('id').to_s == id
end
filter '.' do |name,i|
self.elem? and classes.include? name
end
filter :lt do |num,i|
self.position < num.to_i
end
filter :gt do |num,i|
self.position > num.to_i
end
nth = proc { |num,i| self.position == num.to_i }
nth_first = proc { |*a| self.position == 0 }
nth_last = proc { |*a| self == parent.children_of_type(self.name).last }
filter :nth, &nth
filter :eq, &nth
filter ":nth-of-type", &nth
filter :first, &nth_first
filter ":first-of-type", &nth_first
filter :last, &nth_last
filter ":last-of-type", &nth_last
filter :even do |num,i|
self.position % 2 == 0
end
filter :odd do |num,i|
self.position % 2 == 1
end
filter ':first-child' do |i|
self == parent.containers.first
end
filter ':nth-child' do |arg,i|
case arg
when 'even'; (parent.containers.index(self) + 1) % 2 == 0
when 'odd'; (parent.containers.index(self) + 1) % 2 == 1
else self == (parent.containers[arg.to_i + 1])
end
end
filter ":last-child" do |i|
self == parent.containers.last
end
filter ":nth-last-child" do |arg,i|
self == parent.containers[-1-arg.to_i]
end
filter ":nth-last-of-type" do |arg,i|
self == parent.children_of_type(self.name)[-1-arg.to_i]
end
filter ":only-of-type" do |arg,i|
parent.children_of_type(self.name).length == 1
end
filter ":only-child" do |arg,i|
parent.containers.length == 1
end
filter :parent do
containers.length > 0
end
filter :empty do
containers.length == 0
end
filter :root do
self.is_a? Hpricot::Doc
end
filter 'text' do
self.text?
end
filter 'comment' do
self.comment?
end
filter :contains do |arg, ignore|
html.include? arg
end
pred_procs =
{'text()' => proc { |ele, *_| ele.inner_text.strip },
'@' => proc { |ele, attr, *_| ele.get_attribute(attr).to_s if ele.elem? }}
oper_procs =
{'=' => proc { |a,b| a == b },
'!=' => proc { |a,b| a != b },
'~=' => proc { |a,b| a.split(/\s+/).include?(b) },
'|=' => proc { |a,b| a =~ /^#{Regexp::quote b}(-|$)/ },
'^=' => proc { |a,b| a.index(b) == 0 },
'$=' => proc { |a,b| a =~ /#{Regexp::quote b}$/ },
'*=' => proc { |a,b| idx = a.index(b) }}
pred_procs.each do |pred_n, pred_f|
oper_procs.each do |oper_n, oper_f|
filter "#{pred_n}#{oper_n}" do |*a|
qual = pred_f[self, *a]
oper_f[qual, a[-2]] if qual
end
end
end
filter 'text()' do |val,i|
!self.inner_text.strip.empty?
end
filter '@' do |attr,val,i|
self.elem? and has_attribute? attr
end
filter '[' do |val,i|
self.elem? and search(val).length > 0
end
end
end

@ -0,0 +1,672 @@
module Hpricot
# The code below is auto-generated. Don't edit manually.
# :stopdoc:
NamedCharacters =
{"AElig"=>198, "Aacute"=>193, "Acirc"=>194, "Agrave"=>192, "Alpha"=>913,
"Aring"=>197, "Atilde"=>195, "Auml"=>196, "Beta"=>914, "Ccedil"=>199,
"Chi"=>935, "Dagger"=>8225, "Delta"=>916, "ETH"=>208, "Eacute"=>201,
"Ecirc"=>202, "Egrave"=>200, "Epsilon"=>917, "Eta"=>919, "Euml"=>203,
"Gamma"=>915, "Iacute"=>205, "Icirc"=>206, "Igrave"=>204, "Iota"=>921,
"Iuml"=>207, "Kappa"=>922, "Lambda"=>923, "Mu"=>924, "Ntilde"=>209, "Nu"=>925,
"OElig"=>338, "Oacute"=>211, "Ocirc"=>212, "Ograve"=>210, "Omega"=>937,
"Omicron"=>927, "Oslash"=>216, "Otilde"=>213, "Ouml"=>214, "Phi"=>934,
"Pi"=>928, "Prime"=>8243, "Psi"=>936, "Rho"=>929, "Scaron"=>352, "Sigma"=>931,
"THORN"=>222, "Tau"=>932, "Theta"=>920, "Uacute"=>218, "Ucirc"=>219,
"Ugrave"=>217, "Upsilon"=>933, "Uuml"=>220, "Xi"=>926, "Yacute"=>221,
"Yuml"=>376, "Zeta"=>918, "aacute"=>225, "acirc"=>226, "acute"=>180,
"aelig"=>230, "agrave"=>224, "alefsym"=>8501, "alpha"=>945, "amp"=>38,
"and"=>8743, "ang"=>8736, "apos"=>39, "aring"=>229, "asymp"=>8776,
"atilde"=>227, "auml"=>228, "bdquo"=>8222, "beta"=>946, "brvbar"=>166,
"bull"=>8226, "cap"=>8745, "ccedil"=>231, "cedil"=>184, "cent"=>162,
"chi"=>967, "circ"=>710, "clubs"=>9827, "cong"=>8773, "copy"=>169,
"crarr"=>8629, "cup"=>8746, "curren"=>164, "dArr"=>8659, "dagger"=>8224,
"darr"=>8595, "deg"=>176, "delta"=>948, "diams"=>9830, "divide"=>247,
"eacute"=>233, "ecirc"=>234, "egrave"=>232, "empty"=>8709, "emsp"=>8195,
"ensp"=>8194, "epsilon"=>949, "equiv"=>8801, "eta"=>951, "eth"=>240,
"euml"=>235, "euro"=>8364, "exist"=>8707, "fnof"=>402, "forall"=>8704,
"frac12"=>189, "frac14"=>188, "frac34"=>190, "frasl"=>8260, "gamma"=>947,
"ge"=>8805, "gt"=>62, "hArr"=>8660, "harr"=>8596, "hearts"=>9829,
"hellip"=>8230, "iacute"=>237, "icirc"=>238, "iexcl"=>161, "igrave"=>236,
"image"=>8465, "infin"=>8734, "int"=>8747, "iota"=>953, "iquest"=>191,
"isin"=>8712, "iuml"=>239, "kappa"=>954, "lArr"=>8656, "lambda"=>955,
"lang"=>9001, "laquo"=>171, "larr"=>8592, "lceil"=>8968, "ldquo"=>8220,
"le"=>8804, "lfloor"=>8970, "lowast"=>8727, "loz"=>9674, "lrm"=>8206,
"lsaquo"=>8249, "lsquo"=>8216, "lt"=>60, "macr"=>175, "mdash"=>8212,
"micro"=>181, "middot"=>183, "minus"=>8722, "mu"=>956, "nabla"=>8711,
"nbsp"=>160, "ndash"=>8211, "ne"=>8800, "ni"=>8715, "not"=>172, "notin"=>8713,
"nsub"=>8836, "ntilde"=>241, "nu"=>957, "oacute"=>243, "ocirc"=>244,
"oelig"=>339, "ograve"=>242, "oline"=>8254, "omega"=>969, "omicron"=>959,
"oplus"=>8853, "or"=>8744, "ordf"=>170, "ordm"=>186, "oslash"=>248,
"otilde"=>245, "otimes"=>8855, "ouml"=>246, "para"=>182, "part"=>8706,
"permil"=>8240, "perp"=>8869, "phi"=>966, "pi"=>960, "piv"=>982,
"plusmn"=>177, "pound"=>163, "prime"=>8242, "prod"=>8719, "prop"=>8733,
"psi"=>968, "quot"=>34, "rArr"=>8658, "radic"=>8730, "rang"=>9002,
"raquo"=>187, "rarr"=>8594, "rceil"=>8969, "rdquo"=>8221, "real"=>8476,
"reg"=>174, "rfloor"=>8971, "rho"=>961, "rlm"=>8207, "rsaquo"=>8250,
"rsquo"=>8217, "sbquo"=>8218, "scaron"=>353, "sdot"=>8901, "sect"=>167,
"shy"=>173, "sigma"=>963, "sigmaf"=>962, "sim"=>8764, "spades"=>9824,
"sub"=>8834, "sube"=>8838, "sum"=>8721, "sup"=>8835, "sup1"=>185, "sup2"=>178,
"sup3"=>179, "supe"=>8839, "szlig"=>223, "tau"=>964, "there4"=>8756,
"theta"=>952, "thetasym"=>977, "thinsp"=>8201, "thorn"=>254, "tilde"=>732,
"times"=>215, "trade"=>8482, "uArr"=>8657, "uacute"=>250, "uarr"=>8593,
"ucirc"=>251, "ugrave"=>249, "uml"=>168, "upsih"=>978, "upsilon"=>965,
"uuml"=>252, "weierp"=>8472, "xi"=>958, "yacute"=>253, "yen"=>165,
"yuml"=>255, "zeta"=>950, "zwj"=>8205, "zwnj"=>8204}
NamedCharactersPattern = /\A(?-mix:AElig|Aacute|Acirc|Agrave|Alpha|Aring|Atilde|Auml|Beta|Ccedil|Chi|Dagger|Delta|ETH|Eacute|Ecirc|Egrave|Epsilon|Eta|Euml|Gamma|Iacute|Icirc|Igrave|Iota|Iuml|Kappa|Lambda|Mu|Ntilde|Nu|OElig|Oacute|Ocirc|Ograve|Omega|Omicron|Oslash|Otilde|Ouml|Phi|Pi|Prime|Psi|Rho|Scaron|Sigma|THORN|Tau|Theta|Uacute|Ucirc|Ugrave|Upsilon|Uuml|Xi|Yacute|Yuml|Zeta|aacute|acirc|acute|aelig|agrave|alefsym|alpha|amp|and|ang|apos|aring|asymp|atilde|auml|bdquo|beta|brvbar|bull|cap|ccedil|cedil|cent|chi|circ|clubs|cong|copy|crarr|cup|curren|dArr|dagger|darr|deg|delta|diams|divide|eacute|ecirc|egrave|empty|emsp|ensp|epsilon|equiv|eta|eth|euml|euro|exist|fnof|forall|frac12|frac14|frac34|frasl|gamma|ge|gt|hArr|harr|hearts|hellip|iacute|icirc|iexcl|igrave|image|infin|int|iota|iquest|isin|iuml|kappa|lArr|lambda|lang|laquo|larr|lceil|ldquo|le|lfloor|lowast|loz|lrm|lsaquo|lsquo|lt|macr|mdash|micro|middot|minus|mu|nabla|nbsp|ndash|ne|ni|not|notin|nsub|ntilde|nu|oacute|ocirc|oelig|ograve|oline|omega|omicron|oplus|or|ordf|ordm|oslash|otilde|otimes|ouml|para|part|permil|perp|phi|pi|piv|plusmn|pound|prime|prod|prop|psi|quot|rArr|radic|rang|raquo|rarr|rceil|rdquo|real|reg|rfloor|rho|rlm|rsaquo|rsquo|sbquo|scaron|sdot|sect|shy|sigma|sigmaf|sim|spades|sub|sube|sum|sup|sup1|sup2|sup3|supe|szlig|tau|there4|theta|thetasym|thinsp|thorn|tilde|times|trade|uArr|uacute|uarr|ucirc|ugrave|uml|upsih|upsilon|uuml|weierp|xi|yacute|yen|yuml|zeta|zwj|zwnj)\z/
ElementContent =
{"h6"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"object"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "param", "pre", "q",
"s", "samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"dl"=>["dd", "dt"],
"p"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"acronym"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"code"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"ul"=>["li"],
"tt"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"label"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"form"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"q"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"thead"=>["tr"],
"area"=>:EMPTY,
"td"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"title"=>[],
"dir"=>["li"],
"s"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"ol"=>["li"],
"hr"=>:EMPTY,
"applet"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "param", "pre", "q",
"s", "samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"table"=>["caption", "col", "colgroup", "tbody", "tfoot", "thead", "tr"],
"legend"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"cite"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"a"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"html"=>
["a", "abbr", "acronym", "address", "applet", "b", "base", "basefont", "bdo",
"big", "blockquote", "body", "br", "button", "center", "cite", "code",
"dfn", "dir", "div", "dl", "em", "fieldset", "font", "form", "h1", "h2",
"h3", "h4", "h5", "h6", "head", "hr", "i", "iframe", "img", "input",
"isindex", "kbd", "label", "map", "menu", "noframes", "noscript", "object",
"ol", "p", "pre", "q", "s", "samp", "script", "select", "small", "span",
"strike", "strong", "sub", "sup", "table", "textarea", "title", "tt", "u",
"ul", "var"],
"u"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"blockquote"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"center"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"b"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"base"=>:EMPTY,
"th"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"link"=>:EMPTY,
"var"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"samp"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"div"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"textarea"=>[],
"pre"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"head"=>["base", "isindex", "title"],
"span"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"br"=>:EMPTY,
"script"=>:CDATA,
"noframes"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"style"=>:CDATA,
"meta"=>:EMPTY,
"dt"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"option"=>[],
"kbd"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"big"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"tfoot"=>["tr"],
"sup"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"bdo"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"isindex"=>:EMPTY,
"dfn"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"fieldset"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "legend",
"map", "menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"em"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"font"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"tbody"=>["tr"],
"noscript"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"li"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"col"=>:EMPTY,
"small"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"dd"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"i"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"menu"=>["li"],
"strong"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"basefont"=>:EMPTY,
"img"=>:EMPTY,
"optgroup"=>["option"],
"map"=>
["address", "area", "blockquote", "center", "dir", "div", "dl", "fieldset",
"form", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "isindex", "menu",
"noframes", "noscript", "ol", "p", "pre", "table", "ul"],
"h1"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"address"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "p", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"sub"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"param"=>:EMPTY,
"input"=>:EMPTY,
"h2"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"abbr"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"h3"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"strike"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"body"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"ins"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"button"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"h4"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"select"=>["optgroup", "option"],
"caption"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"colgroup"=>["col"],
"tr"=>["td", "th"],
"del"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"],
"h5"=>
["a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br",
"button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img",
"input", "kbd", "label", "map", "object", "q", "s", "samp", "script",
"select", "small", "span", "strike", "strong", "sub", "sup", "textarea",
"tt", "u", "var"],
"iframe"=>
["a", "abbr", "acronym", "address", "applet", "b", "basefont", "bdo", "big",
"blockquote", "br", "button", "center", "cite", "code", "dfn", "dir", "div",
"dl", "em", "fieldset", "font", "form", "h1", "h2", "h3", "h4", "h5", "h6",
"hr", "i", "iframe", "img", "input", "isindex", "kbd", "label", "map",
"menu", "noframes", "noscript", "object", "ol", "p", "pre", "q", "s",
"samp", "script", "select", "small", "span", "strike", "strong", "sub",
"sup", "table", "textarea", "tt", "u", "ul", "var"]}
ElementInclusions =
{"head"=>["link", "meta", "object", "script", "style"], "body"=>["del", "ins"]}
ElementExclusions =
{"button"=>
["a", "button", "fieldset", "form", "iframe", "input", "isindex", "label",
"select", "textarea"],
"a"=>["a"],
"dir"=>
["address", "blockquote", "center", "dir", "div", "dl", "fieldset", "form",
"h1", "h2", "h3", "h4", "h5", "h6", "hr", "isindex", "menu", "noframes",
"noscript", "ol", "p", "pre", "table", "ul"],
"title"=>["link", "meta", "object", "script", "style"],
"pre"=>
["applet", "basefont", "big", "font", "img", "object", "small", "sub",
"sup"],
"form"=>["form"],
"menu"=>
["address", "blockquote", "center", "dir", "div", "dl", "fieldset", "form",
"h1", "h2", "h3", "h4", "h5", "h6", "hr", "isindex", "menu", "noframes",
"noscript", "ol", "p", "pre", "table", "ul"],
"label"=>["label"]}
OmittedAttrName =
{"h6"=>
{"center"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"right"=>"align", "rtl"=>"dir"},
"object"=>
{"bottom"=>"align", "declare"=>"declare", "left"=>"align", "ltr"=>"dir",
"middle"=>"align", "right"=>"align", "rtl"=>"dir", "top"=>"align"},
"dl"=>{"compact"=>"compact", "ltr"=>"dir", "rtl"=>"dir"},
"p"=>
{"center"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"right"=>"align", "rtl"=>"dir"},
"acronym"=>{"ltr"=>"dir", "rtl"=>"dir"},
"code"=>{"ltr"=>"dir", "rtl"=>"dir"},
"ul"=>
{"circle"=>"type", "compact"=>"compact", "disc"=>"type", "ltr"=>"dir",
"rtl"=>"dir", "square"=>"type"},
"tt"=>{"ltr"=>"dir", "rtl"=>"dir"},
"label"=>{"ltr"=>"dir", "rtl"=>"dir"},
"form"=>{"get"=>"method", "ltr"=>"dir", "post"=>"method", "rtl"=>"dir"},
"q"=>{"ltr"=>"dir", "rtl"=>"dir"},
"thead"=>
{"baseline"=>"valign", "bottom"=>"valign", "center"=>"align",
"char"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"middle"=>"valign", "right"=>"align", "rtl"=>"dir", "top"=>"valign"},
"area"=>
{"circle"=>"shape", "default"=>"shape", "ltr"=>"dir", "nohref"=>"nohref",
"poly"=>"shape", "rect"=>"shape", "rtl"=>"dir"},
"td"=>
{"baseline"=>"valign", "bottom"=>"valign", "center"=>"align",
"char"=>"align", "col"=>"scope", "colgroup"=>"scope", "justify"=>"align",
"left"=>"align", "ltr"=>"dir", "middle"=>"valign", "nowrap"=>"nowrap",
"right"=>"align", "row"=>"scope", "rowgroup"=>"scope", "rtl"=>"dir",
"top"=>"valign"},
"title"=>{"ltr"=>"dir", "rtl"=>"dir"},
"dir"=>{"compact"=>"compact", "ltr"=>"dir", "rtl"=>"dir"},
"s"=>{"ltr"=>"dir", "rtl"=>"dir"},
"ol"=>{"compact"=>"compact", "ltr"=>"dir", "rtl"=>"dir"},
"hr"=>
{"center"=>"align", "left"=>"align", "ltr"=>"dir", "noshade"=>"noshade",
"right"=>"align", "rtl"=>"dir"},
"applet"=>
{"bottom"=>"align", "left"=>"align", "middle"=>"align", "right"=>"align",
"top"=>"align"},
"table"=>
{"above"=>"frame", "all"=>"rules", "below"=>"frame", "border"=>"frame",
"box"=>"frame", "center"=>"align", "cols"=>"rules", "groups"=>"rules",
"hsides"=>"frame", "left"=>"align", "lhs"=>"frame", "ltr"=>"dir",
"none"=>"rules", "rhs"=>"frame", "right"=>"align", "rows"=>"rules",
"rtl"=>"dir", "void"=>"frame", "vsides"=>"frame"},
"legend"=>
{"bottom"=>"align", "left"=>"align", "ltr"=>"dir", "right"=>"align",
"rtl"=>"dir", "top"=>"align"},
"cite"=>{"ltr"=>"dir", "rtl"=>"dir"},
"a"=>
{"circle"=>"shape", "default"=>"shape", "ltr"=>"dir", "poly"=>"shape",
"rect"=>"shape", "rtl"=>"dir"},
"html"=>{"ltr"=>"dir", "rtl"=>"dir"},
"u"=>{"ltr"=>"dir", "rtl"=>"dir"},
"blockquote"=>{"ltr"=>"dir", "rtl"=>"dir"},
"center"=>{"ltr"=>"dir", "rtl"=>"dir"},
"b"=>{"ltr"=>"dir", "rtl"=>"dir"},
"th"=>
{"baseline"=>"valign", "bottom"=>"valign", "center"=>"align",
"char"=>"align", "col"=>"scope", "colgroup"=>"scope", "justify"=>"align",
"left"=>"align", "ltr"=>"dir", "middle"=>"valign", "nowrap"=>"nowrap",
"right"=>"align", "row"=>"scope", "rowgroup"=>"scope", "rtl"=>"dir",
"top"=>"valign"},
"link"=>{"ltr"=>"dir", "rtl"=>"dir"},
"var"=>{"ltr"=>"dir", "rtl"=>"dir"},
"samp"=>{"ltr"=>"dir", "rtl"=>"dir"},
"div"=>
{"center"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"right"=>"align", "rtl"=>"dir"},
"textarea"=>
{"disabled"=>"disabled", "ltr"=>"dir", "readonly"=>"readonly", "rtl"=>"dir"},
"pre"=>{"ltr"=>"dir", "rtl"=>"dir"},
"head"=>{"ltr"=>"dir", "rtl"=>"dir"},
"span"=>{"ltr"=>"dir", "rtl"=>"dir"},
"br"=>{"all"=>"clear", "left"=>"clear", "none"=>"clear", "right"=>"clear"},
"script"=>{"defer"=>"defer"},
"noframes"=>{"ltr"=>"dir", "rtl"=>"dir"},
"style"=>{"ltr"=>"dir", "rtl"=>"dir"},
"meta"=>{"ltr"=>"dir", "rtl"=>"dir"},
"dt"=>{"ltr"=>"dir", "rtl"=>"dir"},
"option"=>
{"disabled"=>"disabled", "ltr"=>"dir", "rtl"=>"dir", "selected"=>"selected"},
"kbd"=>{"ltr"=>"dir", "rtl"=>"dir"},
"big"=>{"ltr"=>"dir", "rtl"=>"dir"},
"tfoot"=>
{"baseline"=>"valign", "bottom"=>"valign", "center"=>"align",
"char"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"middle"=>"valign", "right"=>"align", "rtl"=>"dir", "top"=>"valign"},
"sup"=>{"ltr"=>"dir", "rtl"=>"dir"},
"bdo"=>{"ltr"=>"dir", "rtl"=>"dir"},
"isindex"=>{"ltr"=>"dir", "rtl"=>"dir"},
"dfn"=>{"ltr"=>"dir", "rtl"=>"dir"},
"fieldset"=>{"ltr"=>"dir", "rtl"=>"dir"},
"em"=>{"ltr"=>"dir", "rtl"=>"dir"},
"font"=>{"ltr"=>"dir", "rtl"=>"dir"},
"tbody"=>
{"baseline"=>"valign", "bottom"=>"valign", "center"=>"align",
"char"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"middle"=>"valign", "right"=>"align", "rtl"=>"dir", "top"=>"valign"},
"noscript"=>{"ltr"=>"dir", "rtl"=>"dir"},
"li"=>{"ltr"=>"dir", "rtl"=>"dir"},
"col"=>
{"baseline"=>"valign", "bottom"=>"valign", "center"=>"align",
"char"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"middle"=>"valign", "right"=>"align", "rtl"=>"dir", "top"=>"valign"},
"small"=>{"ltr"=>"dir", "rtl"=>"dir"},
"dd"=>{"ltr"=>"dir", "rtl"=>"dir"},
"i"=>{"ltr"=>"dir", "rtl"=>"dir"},
"menu"=>{"compact"=>"compact", "ltr"=>"dir", "rtl"=>"dir"},
"strong"=>{"ltr"=>"dir", "rtl"=>"dir"},
"img"=>
{"bottom"=>"align", "ismap"=>"ismap", "left"=>"align", "ltr"=>"dir",
"middle"=>"align", "right"=>"align", "rtl"=>"dir", "top"=>"align"},
"optgroup"=>{"disabled"=>"disabled", "ltr"=>"dir", "rtl"=>"dir"},
"map"=>{"ltr"=>"dir", "rtl"=>"dir"},
"address"=>{"ltr"=>"dir", "rtl"=>"dir"},
"h1"=>
{"center"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"right"=>"align", "rtl"=>"dir"},
"sub"=>{"ltr"=>"dir", "rtl"=>"dir"},
"param"=>{"data"=>"valuetype", "object"=>"valuetype", "ref"=>"valuetype"},
"input"=>
{"bottom"=>"align", "button"=>"type", "checkbox"=>"type",
"checked"=>"checked", "disabled"=>"disabled", "file"=>"type",
"hidden"=>"type", "image"=>"type", "ismap"=>"ismap", "left"=>"align",
"ltr"=>"dir", "middle"=>"align", "password"=>"type", "radio"=>"type",
"readonly"=>"readonly", "reset"=>"type", "right"=>"align", "rtl"=>"dir",
"submit"=>"type", "text"=>"type", "top"=>"align"},
"h2"=>
{"center"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"right"=>"align", "rtl"=>"dir"},
"abbr"=>{"ltr"=>"dir", "rtl"=>"dir"},
"h3"=>
{"center"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"right"=>"align", "rtl"=>"dir"},
"strike"=>{"ltr"=>"dir", "rtl"=>"dir"},
"body"=>{"ltr"=>"dir", "rtl"=>"dir"},
"ins"=>{"ltr"=>"dir", "rtl"=>"dir"},
"button"=>
{"button"=>"type", "disabled"=>"disabled", "ltr"=>"dir", "reset"=>"type",
"rtl"=>"dir", "submit"=>"type"},
"h4"=>
{"center"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"right"=>"align", "rtl"=>"dir"},
"select"=>
{"disabled"=>"disabled", "ltr"=>"dir", "multiple"=>"multiple", "rtl"=>"dir"},
"caption"=>
{"bottom"=>"align", "left"=>"align", "ltr"=>"dir", "right"=>"align",
"rtl"=>"dir", "top"=>"align"},
"colgroup"=>
{"baseline"=>"valign", "bottom"=>"valign", "center"=>"align",
"char"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"middle"=>"valign", "right"=>"align", "rtl"=>"dir", "top"=>"valign"},
"tr"=>
{"baseline"=>"valign", "bottom"=>"valign", "center"=>"align",
"char"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"middle"=>"valign", "right"=>"align", "rtl"=>"dir", "top"=>"valign"},
"del"=>{"ltr"=>"dir", "rtl"=>"dir"},
"h5"=>
{"center"=>"align", "justify"=>"align", "left"=>"align", "ltr"=>"dir",
"right"=>"align", "rtl"=>"dir"},
"iframe"=>
{"0"=>"frameborder", "1"=>"frameborder", "auto"=>"scrolling",
"bottom"=>"align", "left"=>"align", "middle"=>"align", "no"=>"scrolling",
"right"=>"align", "top"=>"align", "yes"=>"scrolling"}}
# :startdoc:
# The code above is auto-generated. Don't edit manually.
end

@ -0,0 +1,107 @@
require 'pp'
module Hpricot
# :stopdoc:
class Elements
def pretty_print(q)
q.object_group(self) { super }
end
alias inspect pretty_print_inspect
end
class Doc
def pretty_print(q)
q.object_group(self) { @children.each {|elt| q.breakable; q.pp elt } }
end
alias inspect pretty_print_inspect
end
class Elem
def pretty_print(q)
if empty?
q.group(1, '{emptyelem', '}') {
q.breakable; q.pp @stag
}
else
q.group(1, "{elem", "}") {
q.breakable; q.pp @stag
if @children
@children.each {|elt| q.breakable; q.pp elt }
end
if @etag
q.breakable; q.pp @etag
end
}
end
end
alias inspect pretty_print_inspect
end
module Leaf
def pretty_print(q)
q.group(1, '{', '}') {
q.text self.class.name.sub(/.*::/,'').downcase
if rs = @raw_string
rs.scan(/[^\r\n]*(?:\r\n?|\n|[^\r\n]\z)/) {|line|
q.breakable
q.pp line
}
elsif self.respond_to? :to_s
q.breakable
q.text self.to_s
end
}
end
alias inspect pretty_print_inspect
end
class STag
def pretty_print(q)
q.group(1, '<', '>') {
q.text @name
if @raw_attributes
@raw_attributes.each {|n, t|
q.breakable
if t
q.text "#{n}=\"#{Hpricot.uxs(t)}\""
else
q.text n
end
}
end
}
end
alias inspect pretty_print_inspect
end
class ETag
def pretty_print(q)
q.group(1, '</', '>') {
q.text @name
}
end
alias inspect pretty_print_inspect
end
class Text
def pretty_print(q)
q.text @content.dump
end
end
class BogusETag
def pretty_print(q)
q.group(1, '{', '}') {
q.text self.class.name.sub(/.*::/,'').downcase
if rs = @raw_string
q.breakable
q.text rs
else
q.text "</#{@name}>"
end
}
end
end
# :startdoc:
end

@ -0,0 +1,37 @@
module Hpricot
class Name; include Hpricot end
class Context; include Hpricot end
# :stopdoc:
module Tag; include Hpricot end
class STag; include Tag end
class ETag; include Tag end
# :startdoc:
module Node; include Hpricot end
module Container; include Node end
class Doc; include Container end
class Elem; include Container end
module Leaf; include Node end
class Text; include Leaf end
class XMLDecl; include Leaf end
class DocType; include Leaf end
class ProcIns; include Leaf end
class Comment; include Leaf end
class BogusETag; include Leaf end
module Traverse end
module Container::Trav; include Traverse end
module Leaf::Trav; include Traverse end
class Doc; module Trav; include Container::Trav end; include Trav end
class Elem; module Trav; include Container::Trav end; include Trav end
class Text; module Trav; include Leaf::Trav end; include Trav end
class XMLDecl; module Trav; include Leaf::Trav end; include Trav end
class DocType; module Trav; include Leaf::Trav end; include Trav end
class ProcIns; module Trav; include Leaf::Trav end; include Trav end
class Comment; module Trav; include Leaf::Trav end; include Trav end
class BogusETag; module Trav; include Leaf::Trav end; include Trav end
class Error < StandardError; end
end

@ -0,0 +1,297 @@
require 'hpricot/htmlinfo'
def Hpricot(input = nil, opts = {}, &blk)
Hpricot.parse(input, opts, &blk)
end
module Hpricot
# Exception class used for any errors related to deficiencies in the system when
# handling the character encodings of a document.
class EncodingError < StandardError; end
# Hpricot.parse parses <i>input</i> and return a document tree.
# represented by Hpricot::Doc.
def Hpricot.parse(input = nil, opts = {}, &blk)
Doc.new(make(input, opts, &blk))
end
# Hpricot::XML parses <i>input</i>, disregarding all the HTML rules
# and returning a document tree.
def Hpricot.XML(input, opts = {})
Doc.new(make(input, opts.merge(:xml => true)))
end
# :stopdoc:
def Hpricot.make(input = nil, opts = {}, &blk)
opts = {:fixup_tags => false}.merge(opts)
unless input or blk
raise ArgumentError, "An Hpricot document must be built from an input source (a String) or a block."
end
conv = opts[:xml] ? :to_s : :downcase
fragment =
if input
case opts[:encoding]
when nil
when 'utf-8'
unless defined? Encoding::Character::UTF8
raise EncodingError, "The ruby-character-encodings library could not be found for utf-8 mode."
end
else
raise EncodingError, "No encoding option `#{opts[:encoding]}' is available."
end
if opts[:xhtml_strict]
opts[:fixup_tags] = true
end
stack = [[nil, nil, [], [], [], []]]
Hpricot.scan(input) do |token|
if stack.last[5] == :CDATA and ![:procins, :comment, :cdata].include?(token[0]) and
!(token[0] == :etag and token[1].casecmp(stack.last[0]).zero?)
token[0] = :text
token[1] = token[3] if token[3]
end
if !opts[:xml] and token[0] == :emptytag
token[1] = token[1].send(conv)
if ElementContent[token[1].downcase] != :EMPTY
token[0] = :stag
end
end
# TODO: downcase instead when parsing attributes?
if !opts[:xml] and token[2].is_a?(Hash)
token[2] = token[2].inject({}) { |hsh,(k,v)| hsh[k.downcase] = v; hsh }
end
case token[0]
when :stag
case opts[:encoding] when 'utf-8'
token.map! { |str| u(str) if str.is_a? String }
end
stagname = token[0] = token[1] = token[1].send(conv)
if ElementContent[stagname] == :EMPTY and !opts[:xml]
token[0] = :emptytag
stack.last[2] << token
else
unless opts[:xml]
if opts[:fixup_tags]
# obey the tag rules set up by the current element
if ElementContent.has_key? stagname
trans = nil
(stack.length-1).downto(0) do |i|
untags = stack[i][5]
break unless untags.include? stagname
# puts "** ILLEGAL #{stagname} IN #{stack[i][0]}"
trans = i
end
if trans.to_i > 1
eles = stack.slice!(trans..-1)
stack.last[2] += eles
# puts "** TRANSPLANTED #{stagname} TO #{stack.last[0]}"
end
elsif opts[:xhtml_strict]
token[2] = {'class' => stagname}
stagname = token[0] = "div"
end
end
# setup tag rules for inside this element
if ElementContent[stagname] == :CDATA
uncontainable_tags = :CDATA
elsif opts[:fixup_tags]
possible_tags = ElementContent[stagname]
excluded_tags, included_tags = stack.last[3..4]
if possible_tags
excluded_tags = excluded_tags | (ElementExclusions[stagname] || [])
included_tags = included_tags | (ElementInclusions[stagname] || [])
containable_tags = (possible_tags | included_tags) - excluded_tags
uncontainable_tags = ElementContent.keys - containable_tags
else
# If the tagname is unknown, it is assumed that any element
# except excluded can be contained.
uncontainable_tags = excluded_tags
end
end
end
unless opts[:xml]
case token[2] when Hash
token[2] = token[2].inject({}) { |hsh,(k,v)| hsh[k.downcase] = v; hsh }
end
end
stack << [stagname, token, [], excluded_tags, included_tags, uncontainable_tags]
end
when :etag
etagname = token[0] = token[1].send(conv)
if opts[:xhtml_strict] and not ElementContent.has_key? etagname
etagname = token[0] = "div"
end
matched_elem = nil
(stack.length-1).downto(0) do |i|
stagname, = stack[i]
if stagname == etagname
matched_elem = stack[i]
stack[i][1] += token
eles = stack.slice!((i+1)..-1)
stack.last[2] += eles
break
end
end
unless matched_elem
stack.last[2] << [:bogus_etag, token.first, token.last]
else
ele = stack.pop
stack.last[2] << ele
end
when :text
l = stack.last[2].last
if l and l[0] == :text
l[1] += token[1]
else
stack.last[2] << token
end
else
stack.last[2] << token
end
end
while 1 < stack.length
ele = stack.pop
stack.last[2] << ele
end
structure_list = stack[0][2]
structure_list.map {|s| build_node(s, opts) }
elsif blk
Hpricot.build(&blk).children
end
end
def Hpricot.build_node(structure, opts = {})
case structure[0]
when String
tagname, _, attrs, sraw, _, _, _, eraw = structure[1]
children = structure[2]
etag = eraw && ETag.parse(tagname, eraw)
stag = STag.parse(tagname, attrs, sraw, true)
if !children.empty? || etag
Elem.new(stag,
children.map {|c| build_node(c, opts) },
etag)
else
Elem.new(stag)
end
when :text
Text.parse_pcdata(structure[1])
when :emptytag
Elem.new(STag.parse(structure[1], structure[2], structure[3], false))
when :bogus_etag
BogusETag.parse(structure[1], structure[2])
when :xmldecl
XMLDecl.parse(structure[2], structure[3])
when :doctype
if opts[:xhtml_strict]
structure[2]['system_id'] = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
structure[2]['public_id'] = "-//W3C//DTD XHTML 1.0 Strict//EN"
end
DocType.parse(structure[1], structure[2], structure[3])
when :procins
ProcIns.parse(structure[1])
when :comment
Comment.parse(structure[1])
when :cdata_content
Text.parse_cdata_content(structure[1])
when :cdata
Text.parse_cdata_section(structure[1])
else
raise Exception, "[bug] unknown structure: #{structure.inspect}"
end
end
def STag.parse(qname, attrs, raw_string, is_stag)
result = STag.new(qname, attrs)
result.raw_string = raw_string
result
end
def ETag.parse(qname, raw_string)
result = self.new(qname)
result.raw_string = raw_string
result
end
def BogusETag.parse(qname, raw_string)
result = self.new(qname)
result.raw_string = raw_string
result
end
def Text.parse_pcdata(raw_string)
result = Text.new(raw_string)
result
end
def Text.parse_cdata_content(raw_string)
result = CData.new(raw_string)
result
end
def Text.parse_cdata_section(content)
result = CData.new(content)
result
end
def XMLDecl.parse(attrs, raw_string)
attrs ||= {}
version = attrs['version']
encoding = attrs['encoding']
case attrs['standalone']
when 'yes'
standalone = true
when 'no'
standalone = false
else
standalone = nil
end
result = XMLDecl.new(version, encoding, standalone)
result.raw_string = raw_string
result
end
def DocType.parse(root_element_name, attrs, raw_string)
if attrs
public_identifier = attrs['public_id']
system_identifier = attrs['system_id']
end
root_element_name = root_element_name.downcase
result = DocType.new(root_element_name, public_identifier, system_identifier)
result.raw_string = raw_string
result
end
def ProcIns.parse(raw_string)
_, target, content = *raw_string.match(/\A<\?(\S+)\s+(.+)/m)
result = ProcIns.new(target, content)
result
end
def Comment.parse(content)
result = Comment.new(content)
result
end
module Pat
NameChar = /[-A-Za-z0-9._:]/
Name = /[A-Za-z_:]#{NameChar}*/
Nmtoken = /#{NameChar}+/
end
# :startdoc:
end

@ -0,0 +1,228 @@
module Hpricot
# :stopdoc:
class Doc
attr_accessor :children
def initialize(children = [])
@children = children ? children.each { |c| c.parent = self } : []
end
def output(out, opts = {})
@children.each do |n|
n.output(out, opts)
end
out
end
def altered!; end
end
class BaseEle
attr_accessor :raw_string, :parent
def html_quote(str)
"\"" + str.gsub('"', '\\"') + "\""
end
def if_output(opts)
if opts[:preserve] and not @raw_string.nil?
@raw_string
else
yield opts
end
end
def pathname; self.name end
def altered!
@raw_string = nil
end
def self.alterable(*fields)
attr_accessor(*fields)
fields.each do |f|
define_method("#{f}=") do |v|
altered!
instance_variable_set("@#{f}", v)
end
end
end
end
class Elem
attr_accessor :stag, :etag, :children
def initialize(stag, children=nil, etag=nil)
@stag, @etag = stag, etag
@children = children ? children.each { |c| c.parent = self } : []
end
def empty?; @children.empty? end
[:name, :raw_attributes, :parent, :altered!].each do |m|
[m, "#{m}="].each { |m2| define_method(m2) { |*a| [@etag, @stag].inject { |_,t| t.send(m2, *a) if t and t.respond_to?(m2) } } }
end
def attributes
if raw_attributes
raw_attributes.inject({}) do |hsh, (k, v)|
hsh[k] = Hpricot.uxs(v)
hsh
end
end
end
def to_plain_text
if self.name == 'br'
"\n"
elsif self.name == 'p'
"\n\n" + super + "\n\n"
elsif self.name == 'a' and self.has_attribute?('href')
"#{super} [#{self['href']}]"
elsif self.name == 'img' and self.has_attribute?('src')
"[img:#{self['src']}]"
else
super
end
end
def pathname; self.name end
def output(out, opts = {})
if empty? and ElementContent[@stag.name] == :EMPTY
@stag.output(out, opts.merge(:style => :empty))
else
@stag.output(out, opts)
@children.each { |n| n.output(out, opts) }
if @etag
@etag.output(out, opts)
elsif !opts[:preserve]
ETag.new(@stag.name).output(out, opts)
end
end
out
end
end
class STag < BaseEle
def initialize(name, attributes=nil)
@name = name.to_s
@raw_attributes = attributes || {}
end
alterable :name, :raw_attributes
def attributes_as_html
if @raw_attributes
@raw_attributes.map do |aname, aval|
" #{aname}" +
(aval ? "=\"#{aval}\"" : "")
end.join
end
end
def output(out, opts = {})
out <<
if_output(opts) do
"<#{@name}#{attributes_as_html}" +
(opts[:style] == :empty ? " /" : "") +
">"
end
end
end
class ETag < BaseEle
def initialize(qualified_name)
@name = qualified_name.to_s
end
alterable :name
def output(out, opts = {})
out <<
if_output(opts) do
"</#{@name}>"
end
end
end
class BogusETag < ETag
def output(out, opts = {}); out << if_output(opts) { '' }; end
end
class Text < BaseEle
def initialize(text)
@content = text
end
alterable :content
def pathname; "text()" end
def to_s
Hpricot.uxs(@content)
end
alias_method :inner_text, :to_s
alias_method :to_plain_text, :to_s
def output(out, opts = {})
out <<
if_output(opts) do
@content
end
end
end
class CData < Text
alias_method :to_s, :content
alias_method :to_plain_text, :content
def output(out, opts = {})
out <<
if_output(opts) do
"<![CDATA[#@content]]>"
end
end
end
class XMLDecl < BaseEle
def initialize(version, encoding, standalone)
@version, @encoding, @standalone = version, encoding, standalone
end
alterable :version, :encoding, :standalone
def pathname; "xmldecl()" end
def output(out, opts = {})
out <<
if_output(opts) do
"<?xml version=\"#{@version}\"" +
(@encoding ? " encoding=\"#{encoding}\"" : "") +
(@standalone != nil ? " standalone=\"#{standalone ? 'yes' : 'no'}\"" : "") +
"?>"
end
end
end
class DocType < BaseEle
def initialize(target, pubid, sysid)
@target, @public_id, @system_id = target, pubid, sysid
end
alterable :target, :public_id, :system_id
def pathname; "doctype()" end
def output(out, opts = {})
out <<
if_output(opts) do
"<!DOCTYPE #{@target} " +
(@public_id ? "PUBLIC \"#{@public_id}\"" : "SYSTEM") +
(@system_id ? " #{html_quote(@system_id)}" : "") + ">"
end
end
end
class ProcIns < BaseEle
def initialize(target, content)
@target, @content = target, content
end
def pathname; "procins()" end
alterable :target, :content
def output(out, opts = {})
out <<
if_output(opts) do
"<?#{@target}" +
(@content ? " #{@content}" : "") +
"?>"
end
end
end
class Comment < BaseEle
def initialize(content)
@content = content
end
def pathname; "comment()" end
alterable :content
def output(out, opts = {})
out <<
if_output(opts) do
"<!--#{@content}-->"
end
end
end
# :startdoc:
end

@ -0,0 +1,164 @@
module Hpricot
FORM_TAGS = [ :form, :input, :select, :textarea ]
SELF_CLOSING_TAGS = [ :base, :meta, :link, :hr, :br, :param, :img, :area, :input, :col ]
# Common sets of attributes.
AttrCore = [:id, :class, :style, :title]
AttrI18n = [:lang, 'xml:lang'.intern, :dir]
AttrEvents = [:onclick, :ondblclick, :onmousedown, :onmouseup, :onmouseover, :onmousemove,
:onmouseout, :onkeypress, :onkeydown, :onkeyup]
AttrFocus = [:accesskey, :tabindex, :onfocus, :onblur]
AttrHAlign = [:align, :char, :charoff]
AttrVAlign = [:valign]
Attrs = AttrCore + AttrI18n + AttrEvents
# All the tags and attributes from XHTML 1.0 Strict
class XHTMLStrict
class << self
attr_accessor :tags, :tagset, :forms, :self_closing, :doctype
end
@doctype = ["-//W3C//DTD XHTML 1.0 Strict//EN", "DTD/xhtml1-strict.dtd"]
@tagset = {
:html => AttrI18n + [:id, :xmlns],
:head => AttrI18n + [:id, :profile],
:title => AttrI18n + [:id],
:base => [:href, :id],
:meta => AttrI18n + [:id, :http, :name, :content, :scheme, 'http-equiv'.intern],
:link => Attrs + [:charset, :href, :hreflang, :type, :rel, :rev, :media],
:style => AttrI18n + [:id, :type, :media, :title, 'xml:space'.intern],
:script => [:id, :charset, :type, :src, :defer, 'xml:space'.intern],
:noscript => Attrs,
:body => Attrs + [:onload, :onunload],
:div => Attrs,
:p => Attrs,
:ul => Attrs,
:ol => Attrs,
:li => Attrs,
:dl => Attrs,
:dt => Attrs,
:dd => Attrs,
:address => Attrs,
:hr => Attrs,
:pre => Attrs + ['xml:space'.intern],
:blockquote => Attrs + [:cite],
:ins => Attrs + [:cite, :datetime],
:del => Attrs + [:cite, :datetime],
:a => Attrs + AttrFocus + [:charset, :type, :name, :href, :hreflang, :rel, :rev, :shape, :coords],
:span => Attrs,
:bdo => AttrCore + AttrEvents + [:lang, 'xml:lang'.intern, :dir],
:br => AttrCore,
:em => Attrs,
:strong => Attrs,
:dfn => Attrs,
:code => Attrs,
:samp => Attrs,
:kbd => Attrs,
:var => Attrs,
:cite => Attrs,
:abbr => Attrs,
:acronym => Attrs,
:q => Attrs + [:cite],
:sub => Attrs,
:sup => Attrs,
:tt => Attrs,
:i => Attrs,
:b => Attrs,
:big => Attrs,
:small => Attrs,
:object => Attrs + [:declare, :classid, :codebase, :data, :type, :codetype, :archive, :standby, :height, :width, :usemap, :name, :tabindex],
:param => [:id, :name, :value, :valuetype, :type],
:img => Attrs + [:src, :alt, :longdesc, :height, :width, :usemap, :ismap],
:map => AttrI18n + AttrEvents + [:id, :class, :style, :title, :name],
:area => Attrs + AttrFocus + [:shape, :coords, :href, :nohref, :alt],
:form => Attrs + [:action, :method, :enctype, :onsubmit, :onreset, :accept, :accept],
:label => Attrs + [:for, :accesskey, :onfocus, :onblur],
:input => Attrs + AttrFocus + [:type, :name, :value, :checked, :disabled, :readonly, :size, :maxlength, :src, :alt, :usemap, :onselect, :onchange, :accept],
:select => Attrs + [:name, :size, :multiple, :disabled, :tabindex, :onfocus, :onblur, :onchange],
:optgroup => Attrs + [:disabled, :label],
:option => Attrs + [:selected, :disabled, :label, :value],
:textarea => Attrs + AttrFocus + [:name, :rows, :cols, :disabled, :readonly, :onselect, :onchange],
:fieldset => Attrs,
:legend => Attrs + [:accesskey],
:button => Attrs + AttrFocus + [:name, :value, :type, :disabled],
:table => Attrs + [:summary, :width, :border, :frame, :rules, :cellspacing, :cellpadding],
:caption => Attrs,
:colgroup => Attrs + AttrHAlign + AttrVAlign + [:span, :width],
:col => Attrs + AttrHAlign + AttrVAlign + [:span, :width],
:thead => Attrs + AttrHAlign + AttrVAlign,
:tfoot => Attrs + AttrHAlign + AttrVAlign,
:tbody => Attrs + AttrHAlign + AttrVAlign,
:tr => Attrs + AttrHAlign + AttrVAlign,
:th => Attrs + AttrHAlign + AttrVAlign + [:abbr, :axis, :headers, :scope, :rowspan, :colspan],
:td => Attrs + AttrHAlign + AttrVAlign + [:abbr, :axis, :headers, :scope, :rowspan, :colspan],
:h1 => Attrs,
:h2 => Attrs,
:h3 => Attrs,
:h4 => Attrs,
:h5 => Attrs,
:h6 => Attrs
}
@tags = @tagset.keys
@forms = @tags & FORM_TAGS
@self_closing = @tags & SELF_CLOSING_TAGS
end
# Additional tags found in XHTML 1.0 Transitional
class XHTMLTransitional
class << self
attr_accessor :tags, :tagset, :forms, :self_closing, :doctype
end
@doctype = ["-//W3C//DTD XHTML 1.0 Transitional//EN", "DTD/xhtml1-transitional.dtd"]
@tagset = XHTMLStrict.tagset.merge \
:strike => Attrs,
:center => Attrs,
:dir => Attrs + [:compact],
:noframes => Attrs,
:basefont => [:id, :size, :color, :face],
:u => Attrs,
:menu => Attrs + [:compact],
:iframe => AttrCore + [:longdesc, :name, :src, :frameborder, :marginwidth, :marginheight, :scrolling, :align, :height, :width],
:font => AttrCore + AttrI18n + [:size, :color, :face],
:s => Attrs,
:applet => AttrCore + [:codebase, :archive, :code, :object, :alt, :name, :width, :height, :align, :hspace, :vspace],
:isindex => AttrCore + AttrI18n + [:prompt]
# Additional attributes found in XHTML 1.0 Transitional
{ :script => [:language],
:a => [:target],
:td => [:bgcolor, :nowrap, :width, :height],
:p => [:align],
:h5 => [:align],
:h3 => [:align],
:li => [:type, :value],
:div => [:align],
:pre => [:width],
:body => [:background, :bgcolor, :text, :link, :vlink, :alink],
:ol => [:type, :compact, :start],
:h4 => [:align],
:h2 => [:align],
:object => [:align, :border, :hspace, :vspace],
:img => [:name, :align, :border, :hspace, :vspace],
:link => [:target],
:legend => [:align],
:dl => [:compact],
:input => [:align],
:h6 => [:align],
:hr => [:align, :noshade, :size, :width],
:base => [:target],
:ul => [:type, :compact],
:br => [:clear],
:form => [:name, :target],
:area => [:target],
:h1 => [:align]
}.each do |k, v|
@tagset[k] += v
end
@tags = @tagset.keys
@forms = @tags & FORM_TAGS
@self_closing = @tags & SELF_CLOSING_TAGS
end
end

@ -0,0 +1,821 @@
require 'hpricot/elements'
require 'uri'
module Hpricot
module Traverse
# Is this object the enclosing HTML or XML document?
def doc?() Doc::Trav === self end
# Is this object an HTML or XML element?
def elem?() Elem::Trav === self end
# Is this object an HTML text node?
def text?() Text::Trav === self end
# Is this object an XML declaration?
def xmldecl?() XMLDecl::Trav === self end
# Is this object a doctype tag?
def doctype?() DocType::Trav === self end
# Is this object an XML processing instruction?
def procins?() ProcIns::Trav === self end
# Is this object a comment?
def comment?() Comment::Trav === self end
# Is this object a stranded end tag?
def bogusetag?() BogusETag::Trav === self end
# Builds an HTML string from this node and its contents.
# If you need to write to a stream, try calling <tt>output(io)</tt>
# as a method on this object.
def to_html
output("")
end
alias_method :to_s, :to_html
# Attempts to preserve the original HTML of the document, only
# outputing new tags for elements which have changed.
def to_original_html
output("", :preserve => true)
end
def index(name)
i = 0
return i if name == "*"
children.each do |x|
return i if (x.respond_to?(:name) and name == x.name) or
(x.text? and name == "text()")
i += 1
end
-1
end
# Puts together an array of neighboring nodes based on their proximity
# to this node. So, for example, to get the next node, you could use
# <tt>nodes_at(1). Or, to get the previous node, use <tt>nodes_at(1)</tt>.
#
# This method also accepts ranges and sets of numbers.
#
# ele.nodes_at(-3..-1, 1..3) # gets three nodes before and three after
# ele.nodes_at(1, 5, 7) # gets three nodes at offsets below the current node
# ele.nodes_at(0, 5..6) # the current node and two others
def nodes_at(*pos)
sib = parent.children
i, si = 0, sib.index(self)
pos.map! do |r|
if r.is_a?(Range) and r.begin.is_a?(String)
r = Range.new(parent.index(r.begin)-si, parent.index(r.end)-si, r.exclude_end?)
end
r
end
p pos
Elements[*
sib.select do |x|
sel =
case i - si when *pos
true
end
i += 1
sel
end
]
end
# Returns the node neighboring this node to the south: just below it.
# This method includes text nodes and comments and such.
def next
sib = parent.children
sib[sib.index(self) + 1] if parent
end
alias_method :next_node, :next
# Returns to node neighboring this node to the north: just above it.
# This method includes text nodes and comments and such.
def previous
sib = parent.children
x = sib.index(self) - 1
sib[x] if sib and x >= 0
end
alias_method :previous_node, :previous
# Find all preceding nodes.
def preceding
sibs = parent.children
si = sibs.index(self)
return Elements[*sibs[0...si]]
end
# Find all nodes which follow the current one.
def following
sibs = parent.children
si = sibs.index(self) + 1
return Elements[*sibs[si...sibs.length]]
end
# Adds elements immediately after this element, contained in the +html+ string.
def after(html = nil, &blk)
parent.insert_after(Hpricot.make(html, &blk), self)
end
# Adds elements immediately before this element, contained in the +html+ string.
def before(html = nil, &blk)
parent.insert_before(Hpricot.make(html, &blk), self)
end
# Replace this element and its contents with the nodes contained
# in the +html+ string.
def swap(html = nil, &blk)
parent.altered!
parent.replace_child(self, Hpricot.make(html, &blk))
end
def get_subnode(*indexes)
n = self
indexes.each {|index|
n = n.get_subnode_internal(index)
}
n
end
# Builds a string from the text contained in this node. All
# HTML elements are removed.
def to_plain_text
if respond_to? :children
children.map { |x| x.to_plain_text }.join.strip.gsub(/\n{2,}/, "\n\n")
end
end
# Builds a string from the text contained in this node. All
# HTML elements are removed.
def inner_text
if respond_to? :children
children.map { |x| x.inner_text }.join
end
end
alias_method :innerText, :inner_text
# Builds an HTML string from the contents of this node.
def html(inner = nil, &blk)
if inner or blk
altered!
case inner
when Array
self.children = inner
else
self.children = Hpricot.make(inner, &blk)
end
reparent self.children
else
if respond_to? :children
children.map { |x| x.output("") }.join
end
end
end
alias_method :inner_html, :html
alias_method :innerHTML, :inner_html
# Inserts new contents into the current node, based on
# the HTML contained in string +inner+.
def inner_html=(inner)
html(inner || [])
end
alias_method :innerHTML=, :inner_html=
def reparent(nodes)
altered!
[*nodes].each { |e| e.parent = self }
end
private :reparent
def clean_path(path)
path.gsub(/^\s+|\s+$/, '')
end
# Builds a unique XPath string for this node, from the
# root of the document containing it.
def xpath
if elem? and has_attribute? 'id'
"//#{self.name}[@id='#{get_attribute('id')}']"
else
sim, id = 0, 0, 0
parent.children.each do |e|
id = sim if e == self
sim += 1 if e.pathname == self.pathname
end
p = File.join(parent.xpath, self.pathname)
p += "[#{id+1}]" if sim >= 2
p
end
end
# Builds a unique CSS string for this node, from the
# root of the document containing it.
def css_path
if elem? and has_attribute? 'id'
"##{get_attribute('id')}"
else
sim, i, id = 0, 0, 0
parent.children.each do |e|
id = sim if e == self
sim += 1 if e.pathname == self.pathname
end
p = parent.css_path
p = p ? "#{p} > #{self.pathname}" : self.pathname
p += ":nth(#{id})" if sim >= 2
p
end
end
def node_position
parent.children.index(self)
end
def position
parent.children_of_type(self.pathname).index(self)
end
# Searches this node for all elements matching
# the CSS or XPath +expr+. Returns an Elements array
# containing the matching nodes. If +blk+ is given, it
# is used to iterate through the matching set.
def search(expr, &blk)
if Range === expr
return Elements.expand(at(expr.begin), at(expr.end), expr.exclude_end?)
end
last = nil
nodes = [self]
done = []
expr = expr.to_s
hist = []
until expr.empty?
expr = clean_path(expr)
expr.gsub!(%r!^//!, '')
case expr
when %r!^/?\.\.!
last = expr = $'
nodes.map! { |node| node.parent }
when %r!^[>/]\s*!
last = expr = $'
nodes = Elements[*nodes.map { |node| node.children if node.respond_to? :children }.flatten.compact]
when %r!^\+!
last = expr = $'
nodes.map! do |node|
siblings = node.parent.children
siblings[siblings.index(node)+1]
end
nodes.compact!
when %r!^~!
last = expr = $'
nodes.map! do |node|
siblings = node.parent.children
siblings[(siblings.index(node)+1)..-1]
end
nodes.flatten!
when %r!^[|,]!
last = expr = " #$'"
nodes.shift if nodes.first == self
done += nodes
nodes = [self]
else
m = expr.match(%r!^([#.]?)([a-z0-9\\*_-]*)!i).to_a
after = $'
mt = after[%r!:[a-z0-9\\*_-]+!i, 0]
oop = false
if mt and not (mt == ":not" or Traverse.method_defined? "filter[#{mt}]")
after = $'
m[2] += mt
expr = after
end
if m[1] == '#'
oid = get_element_by_id(m[2])
nodes = oid ? [oid] : []
expr = after
else
m[2] = "*" if after =~ /^\(\)/ || m[2] == "" || m[1] == "."
ret = []
nodes.each do |node|
case m[2]
when '*'
node.traverse_element { |n| ret << n }
else
if node.respond_to? :get_elements_by_tag_name
ret += [*node.get_elements_by_tag_name(m[2])] - [*(node unless last)]
end
end
end
nodes = ret
end
last = nil
end
hist << expr
break if hist[-1] == hist[-2]
nodes, expr = Elements.filter(nodes, expr)
end
nodes = done + nodes.flatten.uniq
if blk
nodes.each(&blk)
self
else
Elements[*nodes]
end
end
alias_method :/, :search
# Find the first matching node for the CSS or XPath
# +expr+ string.
def at(expr)
search(expr).first
end
alias_method :%, :at
# +traverse_element+ traverses elements in the tree.
# It yields elements in depth first order.
#
# If _names_ are empty, it yields all elements.
# If non-empty _names_ are given, it should be list of universal names.
#
# A nested element is yielded in depth first order as follows.
#
# t = Hpricot('<a id=0><b><a id=1 /></b><c id=2 /></a>')
# t.traverse_element("a", "c") {|e| p e}
# # =>
# {elem <a id="0"> {elem <b> {emptyelem <a id="1">} </b>} {emptyelem <c id="2">} </a>}
# {emptyelem <a id="1">}
# {emptyelem <c id="2">}
#
# Universal names are specified as follows.
#
# t = Hpricot(<<'End')
# <html>
# <meta name="robots" content="index,nofollow">
# <meta name="author" content="Who am I?">
# </html>
# End
# t.traverse_element("{http://www.w3.org/1999/xhtml}meta") {|e| p e}
# # =>
# {emptyelem <{http://www.w3.org/1999/xhtml}meta name="robots" content="index,nofollow">}
# {emptyelem <{http://www.w3.org/1999/xhtml}meta name="author" content="Who am I?">}
#
def traverse_element(*names, &block) # :yields: element
if names.empty?
traverse_all_element(&block)
else
name_set = {}
names.each {|n| name_set[n] = true }
traverse_some_element(name_set, &block)
end
nil
end
# Find children of a given +tag_name+.
#
# ele.children_of_type('p')
# #=> [...array of paragraphs...]
#
def children_of_type(tag_name)
if respond_to? :children
children.find_all do |x|
x.respond_to?(:pathname) && x.pathname == tag_name
end
end
end
end
module Container::Trav
# Return all children of this node which can contain other
# nodes. This is a good way to get all HTML elements which
# aren't text, comment, doctype or processing instruction nodes.
def containers
children.grep(Container::Trav)
end
# Returns the container node neighboring this node to the south: just below it.
# By "container" node, I mean: this method does not find text nodes or comments or cdata or any of that.
# See Hpricot::Traverse#next_node if you need to hunt out all kinds of nodes.
def next_sibling
sib = parent.containers
sib[sib.index(self) + 1] if parent
end
# Returns the container node neighboring this node to the north: just above it.
# By "container" node, I mean: this method does not find text nodes or comments or cdata or any of that.
# See Hpricot::Traverse#previous_node if you need to hunt out all kinds of nodes.
def previous_sibling
sib = parent.containers
x = sib.index(self) - 1
sib[x] if sib and x >= 0
end
# Find all preceding sibling elements. Like the other "sibling" methods, this weeds
# out text and comment nodes.
def preceding_siblings()
sibs = parent.containers
si = sibs.index(self)
return Elements[*sibs[0...si]]
end
# Find sibling elements which follow the current one. Like the other "sibling" methods, this weeds
# out text and comment nodes.
def following_siblings()
sibs = parent.containers
si = sibs.index(self) + 1
return Elements[*sibs[si...sibs.length]]
end
# Puts together an array of neighboring sibling elements based on their proximity
# to this element.
#
# This method accepts ranges and sets of numbers.
#
# ele.siblings_at(-3..-1, 1..3) # gets three elements before and three after
# ele.siblings_at(1, 5, 7) # gets three elements at offsets below the current element
# ele.siblings_at(0, 5..6) # the current element and two others
#
# Like the other "sibling" methods, this doesn't find text and comment nodes.
# Use nodes_at to include those nodes.
def siblings_at(*pos)
sib = parent.containers
i, si = 0, sib.index(self)
Elements[*
sib.select do |x|
sel = case i - si when *pos
true
end
i += 1
sel
end
]
end
# Replace +old+, a child of the current node, with +new+ node.
def replace_child(old, new)
reparent new
children[children.index(old), 1] = [*new]
end
# Insert +nodes+, an array of HTML elements or a single element,
# before the node +ele+, a child of the current node.
def insert_before(nodes, ele)
case nodes
when Array
nodes.each { |n| insert_before(n, ele) }
else
reparent nodes
children[children.index(ele) || 0, 0] = nodes
end
end
# Insert +nodes+, an array of HTML elements or a single element,
# after the node +ele+, a child of the current node.
def insert_after(nodes, ele)
case nodes
when Array
nodes.reverse_each { |n| insert_after(n, ele) }
else
reparent nodes
idx = children.index(ele)
children[idx ? idx + 1 : children.length, 0] = nodes
end
end
# +each_child+ iterates over each child.
def each_child(&block) # :yields: child_node
children.each(&block)
nil
end
# +each_child_with_index+ iterates over each child.
def each_child_with_index(&block) # :yields: child_node, index
children.each_with_index(&block)
nil
end
# +find_element+ searches an element which universal name is specified by
# the arguments.
# It returns nil if not found.
def find_element(*names)
traverse_element(*names) {|e| return e }
nil
end
# Returns a list of CSS classes to which this element belongs.
def classes
get_attribute('class').to_s.strip.split(/\s+/)
end
def get_element_by_id(id)
traverse_all_element do |ele|
if ele.elem? and eid = ele.get_attribute('id')
return ele if eid.to_s == id
end
end
nil
end
def get_elements_by_tag_name(*a)
list = Elements[]
traverse_element(*a.map { |tag| [tag, "{http://www.w3.org/1999/xhtml}#{tag}"] }.flatten) do |e|
list << e
end
list
end
def each_hyperlink_attribute
traverse_element(
'{http://www.w3.org/1999/xhtml}a',
'{http://www.w3.org/1999/xhtml}area',
'{http://www.w3.org/1999/xhtml}link',
'{http://www.w3.org/1999/xhtml}img',
'{http://www.w3.org/1999/xhtml}object',
'{http://www.w3.org/1999/xhtml}q',
'{http://www.w3.org/1999/xhtml}blockquote',
'{http://www.w3.org/1999/xhtml}ins',
'{http://www.w3.org/1999/xhtml}del',
'{http://www.w3.org/1999/xhtml}form',
'{http://www.w3.org/1999/xhtml}input',
'{http://www.w3.org/1999/xhtml}head',
'{http://www.w3.org/1999/xhtml}base',
'{http://www.w3.org/1999/xhtml}script') {|elem|
case elem.name
when %r{\{http://www.w3.org/1999/xhtml\}(?:base|a|area|link)\z}i
attrs = ['href']
when %r{\{http://www.w3.org/1999/xhtml\}(?:img)\z}i
attrs = ['src', 'longdesc', 'usemap']
when %r{\{http://www.w3.org/1999/xhtml\}(?:object)\z}i
attrs = ['classid', 'codebase', 'data', 'usemap']
when %r{\{http://www.w3.org/1999/xhtml\}(?:q|blockquote|ins|del)\z}i
attrs = ['cite']
when %r{\{http://www.w3.org/1999/xhtml\}(?:form)\z}i
attrs = ['action']
when %r{\{http://www.w3.org/1999/xhtml\}(?:input)\z}i
attrs = ['src', 'usemap']
when %r{\{http://www.w3.org/1999/xhtml\}(?:head)\z}i
attrs = ['profile']
when %r{\{http://www.w3.org/1999/xhtml\}(?:script)\z}i
attrs = ['src', 'for']
end
attrs.each {|attr|
if hyperlink = elem.get_attribute(attr)
yield elem, attr, hyperlink
end
}
}
end
private :each_hyperlink_attribute
# +each_hyperlink_uri+ traverses hyperlinks such as HTML href attribute
# of A element.
#
# It yields Hpricot::Text and URI for each hyperlink.
#
# The URI objects are created with a base URI which is given by
# HTML BASE element or the argument ((|base_uri|)).
# +each_hyperlink_uri+ doesn't yields href of the BASE element.
def each_hyperlink_uri(base_uri=nil) # :yields: hyperlink, uri
base_uri = URI.parse(base_uri) if String === base_uri
links = []
each_hyperlink_attribute {|elem, attr, hyperlink|
if %r{\{http://www.w3.org/1999/xhtml\}(?:base)\z}i =~ elem.name
base_uri = URI.parse(hyperlink.to_s)
else
links << hyperlink
end
}
if base_uri
links.each {|hyperlink| yield hyperlink, base_uri + hyperlink.to_s }
else
links.each {|hyperlink| yield hyperlink, URI.parse(hyperlink.to_s) }
end
end
# +each_hyperlink+ traverses hyperlinks such as HTML href attribute
# of A element.
#
# It yields Hpricot::Text.
#
# Note that +each_hyperlink+ yields HTML href attribute of BASE element.
def each_hyperlink # :yields: text
links = []
each_hyperlink_attribute {|elem, attr, hyperlink|
yield hyperlink
}
end
# +each_uri+ traverses hyperlinks such as HTML href attribute
# of A element.
#
# It yields URI for each hyperlink.
#
# The URI objects are created with a base URI which is given by
# HTML BASE element or the argument ((|base_uri|)).
def each_uri(base_uri=nil) # :yields: URI
each_hyperlink_uri(base_uri) {|hyperlink, uri| yield uri }
end
end
# :stopdoc:
module Doc::Trav
def traverse_all_element(&block)
children.each {|c| c.traverse_all_element(&block) }
end
def xpath
"/"
end
def css_path
nil
end
end
module Elem::Trav
def traverse_all_element(&block)
yield self
children.each {|c| c.traverse_all_element(&block) }
end
end
module Leaf::Trav
def traverse_all_element
yield self
end
end
module Doc::Trav
def traverse_some_element(name_set, &block)
children.each {|c| c.traverse_some_element(name_set, &block) }
end
end
module Elem::Trav
def traverse_some_element(name_set, &block)
yield self if name_set.include? self.name
children.each {|c| c.traverse_some_element(name_set, &block) }
end
end
module Leaf::Trav
def traverse_some_element(name_set)
end
end
# :startdoc:
module Traverse
# +traverse_text+ traverses texts in the tree
def traverse_text(&block) # :yields: text
traverse_text_internal(&block)
nil
end
end
# :stopdoc:
module Container::Trav
def traverse_text_internal(&block)
each_child {|c| c.traverse_text_internal(&block) }
end
end
module Leaf::Trav
def traverse_text_internal
end
end
module Text::Trav
def traverse_text_internal
yield self
end
end
# :startdoc:
module Container::Trav
# +filter+ rebuilds the tree without some components.
#
# node.filter {|descendant_node| predicate } -> node
# loc.filter {|descendant_loc| predicate } -> node
#
# +filter+ yields each node except top node.
# If given block returns false, corresponding node is dropped.
# If given block returns true, corresponding node is retained and
# inner nodes are examined.
#
# +filter+ returns an node.
# It doesn't return location object even if self is location object.
#
def filter(&block)
subst = {}
each_child_with_index {|descendant, i|
if yield descendant
if descendant.elem?
subst[i] = descendant.filter(&block)
else
subst[i] = descendant
end
else
subst[i] = nil
end
}
to_node.subst_subnode(subst)
end
end
module Doc::Trav
# +title+ searches title and return it as a text.
# It returns nil if not found.
#
# +title+ searchs following information.
#
# - <title>...</title> in HTML
# - <title>...</title> in RSS
def title
e = find_element('title',
'{http://www.w3.org/1999/xhtml}title',
'{http://purl.org/rss/1.0/}title',
'{http://my.netscape.com/rdf/simple/0.9/}title')
e && e.extract_text
end
# +author+ searches author and return it as a text.
# It returns nil if not found.
#
# +author+ searchs following information.
#
# - <meta name="author" content="author-name"> in HTML
# - <link rev="made" title="author-name"> in HTML
# - <dc:creator>author-name</dc:creator> in RSS
# - <dc:publisher>author-name</dc:publisher> in RSS
def author
traverse_element('meta',
'{http://www.w3.org/1999/xhtml}meta') {|e|
begin
next unless e.fetch_attr('name').downcase == 'author'
author = e.fetch_attribute('content').strip
return author if !author.empty?
rescue IndexError
end
}
traverse_element('link',
'{http://www.w3.org/1999/xhtml}link') {|e|
begin
next unless e.fetch_attr('rev').downcase == 'made'
author = e.fetch_attribute('title').strip
return author if !author.empty?
rescue IndexError
end
}
if channel = find_element('{http://purl.org/rss/1.0/}channel')
channel.traverse_element('{http://purl.org/dc/elements/1.1/}creator') {|e|
begin
author = e.extract_text.strip
return author if !author.empty?
rescue IndexError
end
}
channel.traverse_element('{http://purl.org/dc/elements/1.1/}publisher') {|e|
begin
author = e.extract_text.strip
return author if !author.empty?
rescue IndexError
end
}
end
nil
end
end
module Doc::Trav
def root
es = []
children.each {|c| es << c if c.elem? }
raise Hpricot::Error, "no element" if es.empty?
raise Hpricot::Error, "multiple top elements" if 1 < es.length
es[0]
end
end
module Elem::Trav
def has_attribute?(name)
self.raw_attributes && self.raw_attributes.has_key?(name.to_s)
end
def get_attribute(name)
a = self.raw_attributes && self.raw_attributes[name.to_s]
a = Hpricot.uxs(a) if a
a
end
alias_method :[], :get_attribute
def set_attribute(name, val)
altered!
self.raw_attributes ||= {}
self.raw_attributes[name.to_s] = Hpricot.xs(val)
end
alias_method :[]=, :set_attribute
def remove_attribute(name)
name = name.to_s
if has_attribute? name
altered!
self.raw_attributes.delete(name)
end
end
end
end

@ -0,0 +1,94 @@
#!/usr/bin/env ruby
# The XChar library is provided courtesy of Sam Ruby (See
# http://intertwingly.net/stories/2005/09/28/xchar.rb)
# --------------------------------------------------------------------
######################################################################
module Hpricot
####################################################################
# XML Character converter, from Sam Ruby:
# (see http://intertwingly.net/stories/2005/09/28/xchar.rb).
#
module XChar # :nodoc:
# See
# http://intertwingly.net/stories/2004/04/14/i18n.html#CleaningWindows
# for details.
CP1252 = { # :nodoc:
128 => 8364, # euro sign
130 => 8218, # single low-9 quotation mark
131 => 402, # latin small letter f with hook
132 => 8222, # double low-9 quotation mark
133 => 8230, # horizontal ellipsis
134 => 8224, # dagger
135 => 8225, # double dagger
136 => 710, # modifier letter circumflex accent
137 => 8240, # per mille sign
138 => 352, # latin capital letter s with caron
139 => 8249, # single left-pointing angle quotation mark
140 => 338, # latin capital ligature oe
142 => 381, # latin capital letter z with caron
145 => 8216, # left single quotation mark
146 => 8217, # right single quotation mark
147 => 8220, # left double quotation mark
148 => 8221, # right double quotation mark
149 => 8226, # bullet
150 => 8211, # en dash
151 => 8212, # em dash
152 => 732, # small tilde
153 => 8482, # trade mark sign
154 => 353, # latin small letter s with caron
155 => 8250, # single right-pointing angle quotation mark
156 => 339, # latin small ligature oe
158 => 382, # latin small letter z with caron
159 => 376, # latin capital letter y with diaeresis
}
# See http://www.w3.org/TR/REC-xml/#dt-chardata for details.
PREDEFINED = {
34 => '&quot;', # quotation mark
38 => '&amp;', # ampersand
60 => '&lt;', # left angle bracket
62 => '&gt;' # right angle bracket
}
PREDEFINED_U = PREDEFINED.inject({}) { |hsh, (k, v)| hsh[v] = k; hsh }
# See http://www.w3.org/TR/REC-xml/#charsets for details.
VALID = [
0x9, 0xA, 0xD,
(0x20..0xD7FF),
(0xE000..0xFFFD),
(0x10000..0x10FFFF)
]
end
class << self
# XML escaped version of chr
def xchr(str)
n = XChar::CP1252[str] || str
case n when *XChar::VALID
XChar::PREDEFINED[n] or (n<128 ? n.chr : "&##{n};")
else
'*'
end
end
# XML escaped version of to_s
def xs(str)
str.to_s.unpack('U*').map {|n| xchr(n)}.join # ASCII, UTF-8
rescue
str.to_s.unpack('C*').map {|n| xchr(n)}.join # ISO-8859-1, WIN-1252
end
# XML unescape
def uxs(str)
str.to_s.
gsub(/\&\w+;/) { |x| (XChar::PREDEFINED_U[x] || ??).chr }.
gsub(/\&\#(\d+);/) { [$1.to_i].pack("U*") }
end
end
end

@ -0,0 +1,17 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Sample XHTML</title>
<link rel='stylesheet' href='test1.css' />
<link rel='stylesheet' href='test2.css' />
<link rel='stylesheet' href='test3.css' />
</head>
<body id='body1'>
<p>Sample XHTML for <a id="link1" href="http://code.whytheluckystiff.net/mouseHole/">MouseHole 2</a>.</p>
<p class='ohmy'>Please filter <a id="link2" href="http://hobix.com/">me</a>!</p>
<p>The third paragraph</p>
<p class="last final"><b>THE FINAL PARAGRAPH</b></p>
</body>
</html>

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

@ -0,0 +1,16 @@
<html>
<HEAD>
<meta http-equiv="Refresh" content="0; url=http://tenderlovemaking.com">
<META http-equiv="Refresh" content="0; url=http://tenderlovemaking.com">
</head>
<body>
<a href ="http://tenderlovemaking.com/">My Site!</a>
<A href ="http://whytheluckystiff.net/">Your Site!</A>
<MAP>
<area HREF="http://whytheluckystiff.net/" COORDS="1,2,3,4"></area>
<AREA HREF="http://tenderlovemaking.com/" COORDS="1,2,3,4">
</area>
<AREA HREF="http://tenderlovemaking.com/" COORDS="5,5,10,10" />
</MAP>
</body>
</html>

@ -0,0 +1,220 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Free Genealogy and Family History Online - The USGenWeb Project</title>
<meta name="keywords" content="free genealogy search" />
<meta name="description" content="Free genealogy and family history online made possible by the USGenWeb Project volunteers. Search free genealogy websites for your ancestors." />
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<link rel="stylesheet" type="text/css" href="usgw-layout.css" />
<link rel="stylesheet" type="text/css" href="usgw.css" />
<style type="text/css">
<!--
.pullquote {
font-family: Verdana, Arial, Helvetica, sans-serif;
font-size: 12px;
float: right;
width: 185px;
margin-top: 10px;
margin-bottom: 2px;
border-top-width: 10px;
border-bottom-width: 3px;
border-top-style: solid;
border-bottom-style: solid;
border-top-color: #38386E;
border-right-color: #38386E;
border-bottom-color: #38386E;
border-left-color: #38386E;
font-style: italic;
font-weight: normal;
border-right-width: 1px;
border-left-width: 1px;
border-right-style: solid;
border-left-style: solid;
padding: 3px;
}
.style2 {color: #003366}
-->
</style>
</head>
<body>
<!-- HEADER DIV -->
<div id="hdr">
<div align="center"><img alt="The USGenWeb Project, Free Genealogy Online" src="images/widelogo.jpg" width="740" height="150" /></div>
</div>
<!-- HEADER LINKS -->
<div id="hdr2">
<div align="center"><img src="images/navbar.gif" width="740" height="30" usemap="#Map" border="0" />
<map name="Map">
<area shape="rect" coords="46,1,126,28" href="index.shtml" alt="Home">
<area shape="rect" coords="134,1,223,28" href="about/index.shtml" alt="About Us">
<area shape="rect" coords="239,1,320,30" href="states/index.shtml" alt="States">
<area shape="rect" coords="332,1,424,28" href="projects/index.shtml" alt="Projects">
<area shape="rect" coords="444,2,555,28" href="research/index.shtml" alt="Researchers">
<area shape="rect" coords="575,0,686,28" href="volunteers/index.shtml" alt="Volunteers">
</map>
</div>
</div>
<!-- CENTER COLUMN -->
<div id="c-block">
<div id="c-col">
<p>&nbsp;</p>
<h3 align="center">Keeping Internet Genealogy Free<br />
<br />
</h3>
<div align="left">
<div>
<table>
<tr>
<td><div class="pullquote">
<p align="center"><span class="style2"><a href="states/counties.shtml">Counties of the Month</a></span><br />
<a href="http://www.rootsweb.com/~inmontgo/">Montgomery County, IN</a><br />
<a href="http://www.rootsweb.com/~flalachu/">Alachua County, FL</a><br />
<br />
<span class="style2"><a href="volunteers/FGS.shtml">Upcoming Events</a></span><br />
FGS Conference 2006<br />
<br />
</p>
</div>
<p><img src="photos/Gena-Farnham-Wallace.jpg" width="150" height="205" align="left" />
<p>Welcome to The USGenWeb Project! We are a group of volunteers working together to provide free genealogy websites for genealogical research in every county and every state of the United States. This Project is non-commercial and fully committed to free genealogy access for everyone.</p>
<p>Organization is by county and state, and this website provides you with links to all the state genealogy websites which, in turn, provide gateways to the counties. The USGenWeb Project also sponsors important Special Projects at the national level and this website provides an entry point to all of those pages, as well.</p>
<p>Clicking on a State Link (on the left) will take you to the State's website. Clicking on the tabs above will take you to additional information and links. </p>
<p>All of the volunteers who make up The USGenWeb Project are very proud of this endeavor and hope that you will find their hard work both beneficial and rewarding. Thank you for visiting!</p>
<p>The USGenWeb Project Team
</p>
<h3 align="center">10th Anniversary<br /> <br />
</h3>
<div align="left">
<p><img src="photos/oldphoto1.jpg" width="175" height="200" align="right" />2006 marks the 10th Anniversary of the USGenWeb Project and I have been looking back over those past 10 years. When the USGenWeb Project began, it was one of the few (if not the only) centralized places on the internet to find genealogy information and post a query. Those early state and county sites began with links to the small amount of on-line information of interest to a family historian and a query page. The only Special Project was the Archives. How far the Project has come during the past 10 years! Now there are several special projects and the states, counties and special projects sites of the Project not only contain links; they are filled with information and transcribed records, and more is being added every day by our wonderful, dedicated and hard working volunteers.</p>
<p>Ten years ago the internet, as we know it today, was in its infancy. The things we take for granted today--e-mail, PCs, cell phones, digital cameras, etc., were not in the average person's world. Family historians and professional genealogists not only didn't use the internet, most had never heard of it.</p>
<p>Over the past 10 years the internet has gone from obscurity to commonplace. As the internet became an every day tool for millions of people. it changed the way family historians do research. The availability of on-line, easily accessible genealogy and historical information has fueled the phenomenal growth of Genealogy as a hobby and, I'm proud to say, the Project has been right there every step of the way. </p>
<p>Everywhere we look we see genealogy reported as the fastest growing hobby in the country. Now the internet is the first stop for beginning family historians and is used extensively by experienced researchers. New &quot;How To&quot; genealogy books devote chapters to using the internet, and it is a rare book that does not recommend The USGenWeb Project as one of the first places to visit.</p>
<p>While subscription sites have popped up everywhere on the web, The Project has continued to offer free access to its vast wealth of information. The USGenWeb Project is recognized as the premier site of free information, and the Project's websites welcome well over a million visitors each day.</p>
<p>The Project is where it is today because of the thousands of volunteers, both past and present, who cared enough to devote, collectively, millions of hours to gathering, transcribing and uploading information. </p>
<p>To each and every volunteer, past and present, a heartfelt Thank You, because you are ones who have made The Project the fabulous resource it is today.</p>
<p>Linda Haas Davenport<br />
National Coordinator<br />
The USGenWeb Project</p>
<p></p></td>
</tr>
</table>
</div>
</p>
</div>
<br />
</div>
<!-- END CENTER COLUMN --></div>
<!-- END C-BLOCK -->
<div id="ftr">
<div align="center">
<div align="center"><img src="images/footer-bar.gif" width="740" height="30" usemap="#footerMap" border="0" /></div>
<map name="footerMap">
<area shape="rect" coords="430,6,565,25" href="http://www.usgenweb.org">
</map>
</div>
</div>
<!-- LEFT COLUMN -->
<div id="lh-col">
<span style="margin:10px 10px 10px 10px;"><br />
<a href="http://www.rootsweb.com/~algenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Alabama Genealogy">Alabama</a><br />
<a href="http://www.akgenweb.org" rel="nofollow" class="sidenavLnk" target=_blank" title="Alaska Genealogy">Alaska</a><br />
<a href="http://www.rootsweb.com/~azgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Arizona Genealogy">Arizona</a><br />
<a href="http://www.rootsweb.com/~argenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Arkansas Genealogy">Arkansas</a><br />
<a href="http://cagenweb.com/" rel="nofollow" class="sidenavLnk" target=_blank" title="California Genealogy">California</a><br />
<a href="http://www.rootsweb.com/~cogenweb/comain.htm" rel="nofollow" class="sidenavLnk" target=_blank" title="Colorado Genealogy">Colorado</a><br />
<a href="http://www.rootsweb.com/~ctgenweb" rel="nofollow" class="sidenavLnk" target=_blank" title="Connecticut Genealogy">Connecticut</a><br />
<a href="http://www.degenweb.org/" rel="nofollow" class="sidenavLnk" target=_blank" title="Delaware Genealogy">Delaware</a><br />
<a href="http://www.rootsweb.com/~dcgenweb/dc_genweb.htm" rel="nofollow" class="sidenavLnk" target=_blank" title="District of Columbia Genealogy">District of Columbia</a><br />
<a href="http://www.rootsweb.com/~flgenweb/index.html" rel="nofollow" class="sidenavLnk" target=_blank" title="Florida Genealogy">Florida</a><br />
<a href="http://www.rootsweb.com/~gagenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Georgia Genealogy">Georgia</a><br />
<a href="http://www.rootsweb.com/~higenweb/hawaii.htm" rel="nofollow" class="sidenavLnk" target=_blank" title="Hawaii Genealogy">Hawaii</a><br />
<a href="http://www.rootsweb.com/~idgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Idaho Genealogy">Idaho</a><br />
<a href="http://ilgenweb.rootsweb.com/" rel="nofollow" class="sidenavLnk" target=_blank" title="Illinois Genealogy">Illinois</a><br />
<a href="http://www.ingenweb.org" rel="nofollow" class="sidenavLnk" target=_blank" title="Indiana Genealogy">Indiana</a><br />
<a href="http://IAGenWeb.org" rel="nofollow" class="sidenavLnk" target=_blank" title="Iowa Genealogy">Iowa</a><br />
<a href="http://skyways.lib.ks.us/genweb/index.html" rel="nofollow" class="sidenavLnk" target=_blank" title="Kansas Genealogy">Kansas</a><br />
<a href="http://www.kygenweb.net/index.html" rel="nofollow" class="sidenavLnk" target=_blank" title="Kentucky Genealogy">Kentucky</a><br />
<a href="http://www.lagenweb.org/" rel="nofollow" class="sidenavLnk" target=_blank" title="Louisiana Genealogy">Louisiana</a><br />
<a href="http://www.rootsweb.com/~megenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Maine Genealogy">Maine</a><br />
<a href="http://www.mdgenweb.org" rel="nofollow" class="sidenavLnk" target=_blank" title="Maryland Genealogy">Maryland</a><br />
<a href="http://www.rootsweb.com/~magenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Massachusetts Genealogy">Massachusetts</a><br />
<a href="http://www.rootsweb.com/~migenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Michigan Genealogy">Michigan</a><br />
<a href="http://www.rootsweb.com/~mngenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Minnesota Genealogy">Minnesota</a><br />
<a href="http://www.rootsweb.com/~msgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Mississippi Genealogy">Mississippi</a><br />
<a href="http://www.rootsweb.com/~mogenweb/mo.htm" rel="nofollow" class="sidenavLnk" target=_blank" title="Missouri Genealogy">Missouri</a><br />
<a href="http://rootsweb.com/~mtgenweb" rel="nofollow" class="sidenavLnk" target=_blank" title="Montana Genealogy">Montana</a><br />
<a href="http://www.rootsweb.com/~negenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Nebraska Genealogy">Nebraska</a><br />
<a href="http://www.rootsweb.com/~nvgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Nevada Genealogy">Nevada</a><br />
<a href="http://www.usroots.com/~usgwnhus/" rel="nofollow" class="sidenavLnk" target=_blank" title="New Hampshire Genealogy">New Hampshire</a><br />
<a href="http://www.rootsweb.com/~njgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="New Jersey Genealogy">New Jersey</a><br />
<a href="http://www.rootsweb.com/~nmgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="New Mexico Genealogy">New Mexico</a><br />
<a href="http://www.rootsweb.com/~nygenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="New York Genealogy">New York</a><br />
<a href="http://www.rootsweb.com/~ncgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="North Carolina Genealogy">North Carolina</a><br />
<a href="http://www.rootsweb.com/~ndgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="North Dakota Genealogy">North Dakota</a><br />
<a href="http://www.rootsweb.com/~ohgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Ohio Genealogy">Ohio</a><br />
<a href="http://www.rootsweb.com/~okgenweb/index.htm" rel="nofollow" class="sidenavLnk" target=_blank" title="Oklahoma Genealogy">Oklahoma</a><br />
<a href="http://www.rootsweb.com/~itgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Oklahoma-Indian Territory Genealogy">Oklahoma/Indian Territory</a><br />
<a href="http://www.rootsweb.com/~orgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Oregon Genealogy">Oregon</a><br />
<a href="http://www.pagenweb.org/" rel="nofollow" class="sidenavLnk" target=_blank" title="Pennsylvania Genealogy">Pennsylvania</a><br />
<a href="http://www.rootsweb.com/~rigenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Rhode Island Genealogy">Rhode Island</a><br />
<a href="http://sciway3.net/scgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="South Carolina Genealogy">South Carolina</a><br />
<a href="http://www.rootsweb.com/~sdgenweb" rel="nofollow" class="sidenavLnk" target=_blank" title="South Dakota Genealogy">South Dakota</a><br />
<a href="http://www.tngenweb.org/" rel="nofollow" class="sidenavLnk" target=_blank" title="Tennessee Genealogy">Tennessee</a><br />
<a href="http://www.rootsweb.com/~txgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Texas Genealogy">Texas</a><br />
<a href="http://www.rootsweb.com/~utgenweb/index.html" rel="nofollow" class="sidenavLnk" target=_blank" title="Utah Genealogy">Utah</a><br />
<a href="http://home.att.net/~Local_History/VT_History.htm" rel="nofollow" class="sidenavLnk" target=_blank" title="Vermont Genealogy">Vermont</a><br />
<a href="http://www.rootsweb.com/~vagenweb" rel="nofollow" class="sidenavLnk" target=_blank" title="Virginia Genealogy">Virginia</a><br />
<a href="http://www.rootsweb.com/~wagenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Washington Genealogy">Washington</a><br />
<a href="http://www.rootsweb.com/~wvgenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="West Virginia Genealogy">West Virginia</a><br />
<a href="http://www.rootsweb.com/~wigenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Wisconsin Genealogy">Wisconsin</a><br />
<a href="http://www.rootsweb.com/~wygenweb/" rel="nofollow" class="sidenavLnk" target=_blank" title="Wyoming Genealogy">Wyoming</a>
</span>
</div>
<!-- END LEFT COLUMN -->
<!-- RIGHT COLUMN -->
<div id="rh-col">
<br />
<span style="margin: 10px 0px 6px 6px;">
<div align="center">
<p><img alt="The USGenWeb Project, Free Genealogy Online" src="images/usgenweb100x104.gif" width="100" height="104" /></p></div></span>
<span style="margin: 10px 0px 6px 6px;">
<div align="left">
<!-- <h4>Search Engines</h4> -->
<p><a href="../states/counties.shtml" rel="nofollow" class="sidenavLnk">County Spotlight</a><br />
<p><a href="http://www.rootsweb.com/~usgenweb/newsearch.htm" rel="nofollow" class="sidenavLnk" target="_blank">&nbsp;Project Archives</a><br />
</div>
<div align="center">
<hr width="75%" size="1" noshade />
</div>
<div align="left">
<p align="left" class="sidenav">Comments and administrative-type problems should be emailed to the <a href="mailto:lhaasdav@cox.net" class="link">National Coordinator</a>.
For complaints regarding a specific web site within the USGenWeb Project, please include the URL when emailing the National Coordinator.</p>
<p align="left" class="sidenav">Direct comments or suggestions about this web site to the <a href="mailto:webmaster@usgenweb.com" class="link">Webmaster</a>. </p>
<br />
<p align="center"><a href="http://www.rootsweb.com" rel="nofollow"><img src="images/rootsweb-blue-68x85.gif" width="68" height="85" border="0" alt="Visit Rootsweb"></a></p>
</div>
<p>
<a href="index.shtml" class="sidenavLnk" title="The USGenWeb Project">Home</a><br />
<a href="about/index.shtml" class="sidenavLnk" title="About The USGenWeb Project">About Us</a><br />
<a href="projects/index.shtml" class="sidenavLnk" title="Genealogy Projects">Projects</a><br />
<a href="research/index.shtml" class="sidenavLnk" title="Help for Genealogy Research">for Researchers</a><br />
<a href="volunteers/index.shtml" class="sidenavLnk" title="USGenWeb Volunteers">for Volunteers</a><br />
<a href="sitemap.shtml" class="sidenavLnk">Site Map</a></p>
</span>
</div>
<!-- END RIGHT COLUMN -->
</body>
</html>

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

@ -0,0 +1,19 @@
<?xml version='1.0'?><rss xmlns:admin='http://webns.net/mvcb/' version='2.0' xmlns:sy='http://purl.org/rss/1.0/modules/syndication/' xmlns:dc='http://purl.org/dc/elements/1.1/' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<channel>
<title>why the lucky stiff</title>
<link>http://whytheluckystiff.net</link>
<description>hex-editing reality to give us infinite grenades!!</description>
<dc:language>en-us</dc:language>
<dc:creator/>
<dc:date>2007-01-16T22:39:04+00:00</dc:date>
<admin:generatorAgent rdf:resource='http://hobix.com/?v=0.4'/>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>
<item><title>1.3</title><link>http://whytheluckystiff.net/quatrains/1.3.html</link><guid isPermaLink='false'>quatrains/1.3@http://whytheluckystiff.net</guid><dc:subject>quatrains</dc:subject><dc:subject>quatrains</dc:subject><dc:creator>why the lucky stiff</dc:creator><dc:date>2007-01-14T08:47:05+00:00</dc:date><description>&lt;blockquote&gt;
&lt;p&gt;That cadillac of yours and that driver of yours!&lt;br /&gt;You and your teacups rattling away in the back seat!&lt;br /&gt;You always took the mike, oh, and all those cowboys you shot!&lt;br /&gt;I held your hand! And I&amp;#8217;ll shoot a cowboy one day!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;You said, &amp;#8220;Let&amp;#8217;s run into the woods like kids!&amp;#8221; &lt;br /&gt;You said, &amp;#8220;Let&amp;#8217;s rub our hands together super-hot!&amp;#8221; &lt;br /&gt;And we scalded the trees and left octagons, I think that was you and&lt;br /&gt;You threw parties on the roof!&lt;/p&gt;
&lt;/blockquote&gt;</description></item></channel>
</rss>

@ -0,0 +1,7 @@
module TestFiles
Dir.chdir(File.dirname(__FILE__)) do
Dir['files/*.{html,xhtml,xml}'].each do |fname|
const_set fname[%r!/(\w+)\.\w+$!, 1].upcase, IO.read(fname)
end
end
end

@ -0,0 +1,65 @@
#!/usr/bin/env ruby
require 'test/unit'
require 'hpricot'
require 'load_files'
class TestAlter < Test::Unit::TestCase
def setup
@basic = Hpricot.parse(TestFiles::BASIC)
end
def test_before
test0 = "<link rel='stylesheet' href='test0.css' />"
@basic.at("link").before(test0)
assert_equal 'test0.css', @basic.at("link").attributes['href']
end
def test_after
test_inf = "<link rel='stylesheet' href='test_inf.css' />"
@basic.search("link")[-1].after(test_inf)
assert_equal 'test_inf.css', @basic.search("link")[-1].attributes['href']
end
def test_wrap
ohmy = (@basic/"p.ohmy").wrap("<div id='wrapper'></div>")
assert_equal 'wrapper', ohmy[0].parent['id']
assert_equal 'ohmy', Hpricot(@basic.to_html).at("#wrapper").children[0]['class']
end
def test_add_class
first_p = (@basic/"p:first").add_class("testing123")
assert first_p[0].get_attribute("class").split(" ").include?("testing123")
assert (Hpricot(@basic.to_html)/"p:first")[0].attributes["class"].split(" ").include?("testing123")
assert !(Hpricot(@basic.to_html)/"p:gt(0)")[0].attributes["class"].split(" ").include?("testing123")
end
def test_change_attributes
all_ps = (@basic/"p").attr("title", "Some Title")
all_as = (@basic/"a").attr("href", "http://my_new_href.com")
all_lb = (@basic/"link").attr("href") { |e| e.name }
assert_changed(@basic, "p", all_ps) {|p| p.attributes["title"] == "Some Title"}
assert_changed(@basic, "a", all_as) {|a| a.attributes["href"] == "http://my_new_href.com"}
assert_changed(@basic, "link", all_lb) {|a| a.attributes["href"] == "link" }
end
def test_remove_attr
all_rl = (@basic/"link").remove_attr("href")
assert_changed(@basic, "link", all_rl) { |link| link['href'].nil? }
end
def test_remove_class
all_c1 = (@basic/"p[@class*='last']").remove_class("last")
assert_changed(@basic, "p[@class*='last']", all_c1) { |p| p['class'] == 'final' }
end
def test_remove_all_classes
all_c2 = (@basic/"p[@class]").remove_class
assert_changed(@basic, "p[@class]", all_c2) { |p| p['class'].nil? }
end
def assert_changed original, selector, set, &block
assert set.all?(&block)
assert Hpricot(original.to_html).search(selector).all?(&block)
end
end

@ -0,0 +1,24 @@
#!/usr/bin/env ruby
require 'test/unit'
require 'hpricot'
class TestBuilder < Test::Unit::TestCase
def test_escaping_text
doc = Hpricot() { b "<a\"b>" }
assert_equal "<b>&lt;a&quot;b&gt;</b>", doc.to_html
assert_equal %{<a"b>}, doc.at("text()").to_s
end
def test_no_escaping_text
doc = Hpricot() { div.test.me! { text "<a\"b>" } }
assert_equal %{<div class="test" id="me"><a"b></div>}, doc.to_html
assert_equal %{<a"b>}, doc.at("text()").to_s
end
def test_latin1_entities
doc = Hpricot() { b "\200\225" }
assert_equal "<b>&#8364;&#8226;</b>", doc.to_html
assert_equal "\342\202\254\342\200\242", doc.at("text()").to_s
end
end

@ -0,0 +1,379 @@
#!/usr/bin/env ruby
require 'test/unit'
require 'hpricot'
require 'load_files'
class TestParser < Test::Unit::TestCase
def test_set_attr
@basic = Hpricot.parse(TestFiles::BASIC)
@basic.search('//p').set('class', 'para')
assert_equal 4, @basic.search('//p').length
assert_equal 4, @basic.search('//p').find_all { |x| x['class'] == 'para' }.length
end
# Test creating a new element
def test_new_element
elem = Hpricot::Elem.new(Hpricot::STag.new('form'))
assert_not_nil(elem)
assert_not_nil(elem.attributes)
end
def test_scan_text
assert_equal 'FOO', Hpricot.make("FOO").first.content
end
def test_filter_by_attr
@boingboing = Hpricot.parse(TestFiles::BOINGBOING)
# this link is escaped in the doc
link = 'http://www.youtube.com/watch?v=TvSNXyNw26g&search=chris%20ware'
assert_equal link, @boingboing.at("a[@href='#{link}']")['href']
end
def test_filter_contains
@basic = Hpricot.parse(TestFiles::BASIC)
assert_equal '<title>Sample XHTML</title>', @basic.search("title:contains('Sample')").to_s
end
def test_get_element_by_id
@basic = Hpricot.parse(TestFiles::BASIC)
assert_equal 'link1', @basic.get_element_by_id('link1')['id']
assert_equal 'link1', @basic.get_element_by_id('body1').get_element_by_id('link1').get_attribute('id')
end
def test_get_element_by_tag_name
@basic = Hpricot.parse(TestFiles::BASIC)
assert_equal 'link1', @basic.get_elements_by_tag_name('a')[0].get_attribute('id')
assert_equal 'link1', @basic.get_elements_by_tag_name('body')[0].get_element_by_id('link1').get_attribute('id')
end
def test_output_basic
@basic = Hpricot.parse(TestFiles::BASIC)
@basic2 = Hpricot.parse(@basic.inner_html)
scan_basic @basic2
end
def test_scan_basic
@basic = Hpricot.parse(TestFiles::BASIC)
scan_basic @basic
end
def scan_basic doc
assert_kind_of Hpricot::XMLDecl, doc.children.first
assert_not_equal doc.children.first.to_s, doc.children[1].to_s
assert_equal 'link1', doc.at('#link1')['id']
assert_equal 'link1', doc.at("p a")['id']
assert_equal 'link1', (doc/:p/:a).first['id']
assert_equal 'link1', doc.search('p').at('a').get_attribute('id')
assert_equal 'link2', (doc/'p').filter('.ohmy').search('a').first.get_attribute('id')
assert_equal (doc/'p')[2], (doc/'p').filter(':nth(2)')[0]
assert_equal (doc/'p')[2], (doc/'p').filter('[3]')[0]
assert_equal 4, (doc/'p').filter('*').length
assert_equal 4, (doc/'p').filter('* *').length
eles = (doc/'p').filter('.ohmy')
assert_equal 1, eles.length
assert_equal 'ohmy', eles.first.get_attribute('class')
assert_equal 3, (doc/'p:not(.ohmy)').length
assert_equal 3, (doc/'p').not('.ohmy').length
assert_equal 3, (doc/'p').not(eles.first).length
assert_equal 2, (doc/'p').filter('[@class]').length
assert_equal 'last final', (doc/'p[@class~="final"]').first.get_attribute('class')
assert_equal 1, (doc/'p').filter('[@class~="final"]').length
assert_equal 2, (doc/'p > a').length
assert_equal 1, (doc/'p.ohmy > a').length
assert_equal 2, (doc/'p / a').length
assert_equal 2, (doc/'link ~ link').length
assert_equal 3, (doc/'title ~ link').length
assert_equal 5, (doc/"//p/text()").length
assert_equal 6, (doc/"//p[a]//text()").length
assert_equal 2, (doc/"//p/a/text()").length
end
def test_positional
h = Hpricot( "<div><br/><p>one</p><p>two</p></div>" )
assert_equal "<p>one</p>", h.search("//div/p:eq(0)").to_s
assert_equal "<p>one</p>", h.search("//div/p:first").to_s
assert_equal "<p>one</p>", h.search("//div/p:first()").to_s
end
def test_pace
doc = Hpricot(TestFiles::PACE_APPLICATION)
assert_equal 'get', doc.at('form[@name=frmSect11]')['method']
# assert_equal '2', doc.at('#hdnSpouse')['value']
end
def test_scan_boingboing
@boingboing = Hpricot.parse(TestFiles::BOINGBOING)
assert_equal 60, (@boingboing/'p.posted').length
assert_equal 1, @boingboing.search("//a[@name='027906']").length
assert_equal 10, @boingboing.search("script comment()").length
assert_equal 3, @boingboing.search("a[text()*='Boing']").length
assert_equal 1, @boingboing.search("h3[text()='College kids reportedly taking more smart drugs']").length
assert_equal 0, @boingboing.search("h3[text()='College']").length
assert_equal 60, @boingboing.search("h3").length
assert_equal 59, @boingboing.search("h3[text()!='College kids reportedly taking more smart drugs']").length
assert_equal 17, @boingboing.search("h3[text()$='s']").length
assert_equal 129, @boingboing.search("p[text()]").length
assert_equal 211, @boingboing.search("p").length
end
def test_reparent
doc = Hpricot(%{<div id="blurb_1"></div>})
div1 = doc.search('#blurb_1')
div1.before('<div id="blurb_0"></div>')
div0 = doc.search('#blurb_0')
div0.before('<div id="blurb_a"></div>')
assert_equal 'div', doc.at('#blurb_1').name
end
def test_siblings
@basic = Hpricot.parse(TestFiles::BASIC)
t = @basic.at(:title)
e = t.next_sibling
assert_equal 'test1.css', e['href']
assert_equal 'title', e.previous_sibling.name
end
def test_css_negation
@basic = Hpricot.parse(TestFiles::BASIC)
assert_equal 3, (@basic/'p:not(.final)').length
end
def test_remove_attribute
@basic = Hpricot.parse(TestFiles::BASIC)
(@basic/:p).each { |ele| ele.remove_attribute('class') }
assert_equal 0, (@basic/'p[@class]').length
end
def test_abs_xpath
@boingboing = Hpricot.parse(TestFiles::BOINGBOING)
assert_equal 60, @boingboing.search("/html/body//p[@class='posted']").length
assert_equal 60, @boingboing.search("/*/body//p[@class='posted']").length
assert_equal 18, @boingboing.search("//script").length
divs = @boingboing.search("//script/../div")
assert_equal 1, divs.length
imgs = @boingboing.search('//div/p/a/img')
assert_equal 15, imgs.length
assert_equal 17, @boingboing.search('//div').search('p/a/img').length
assert imgs.all? { |x| x.name == 'img' }
end
def test_predicates
@boingboing = Hpricot.parse(TestFiles::BOINGBOING)
assert_equal 2, @boingboing.search('//link[@rel="alternate"]').length
p_imgs = @boingboing.search('//div/p[/a/img]')
assert_equal 15, p_imgs.length
assert p_imgs.all? { |x| x.name == 'p' }
p_imgs = @boingboing.search('//div/p[a/img]')
assert_equal 18, p_imgs.length
assert p_imgs.all? { |x| x.name == 'p' }
assert_equal 1, @boingboing.search('//input[@checked]').length
end
def test_tag_case
@tenderlove = Hpricot.parse(TestFiles::TENDERLOVE)
assert_equal 2, @tenderlove.search('//a').length
assert_equal 3, @tenderlove.search('//area').length
assert_equal 2, @tenderlove.search('//meta').length
end
def test_alt_predicates
@boingboing = Hpricot.parse(TestFiles::BOINGBOING)
assert_equal 1, @boingboing.search('//table/tr:last').length
@basic = Hpricot.parse(TestFiles::BASIC)
assert_equal "<p>The third paragraph</p>",
@basic.search('p:eq(2)').to_html
assert_equal '<p class="last final"><b>THE FINAL PARAGRAPH</b></p>',
@basic.search('p:last').to_html
assert_equal 'last final', @basic.search('//p:last-of-type').first.get_attribute('class')
end
def test_insert_after # ticket #63
doc = Hpricot('<html><body><div id="a-div"></div></body></html>')
(doc/'div').each do |element|
element.after('<p>Paragraph 1</p><p>Paragraph 2</p>')
end
assert_equal doc.to_html, '<html><body><div id="a-div"></div><p>Paragraph 1</p><p>Paragraph 2</p></body></html>'
end
def test_insert_before # ticket #61
doc = Hpricot('<html><body><div id="a-div"></div></body></html>')
(doc/'div').each do |element|
element.before('<p>Paragraph 1</p><p>Paragraph 2</p>')
end
assert_equal doc.to_html, '<html><body><p>Paragraph 1</p><p>Paragraph 2</p><div id="a-div"></div></body></html>'
end
def test_many_paths
@boingboing = Hpricot.parse(TestFiles::BOINGBOING)
assert_equal 62, @boingboing.search('p.posted, link[@rel="alternate"]').length
assert_equal 20, @boingboing.search('//div/p[a/img]|//link[@rel="alternate"]').length
end
def test_stacked_search
@boingboing = Hpricot.parse(TestFiles::BOINGBOING)
assert_kind_of Hpricot::Elements, @boingboing.search('//div/p').search('a img')
end
def test_class_search
# test case sent by Chih-Chao Lam
doc = Hpricot("<div class=xyz'>abc</div>")
assert_equal 1, doc.search(".xyz").length
doc = Hpricot("<div class=xyz>abc</div><div class=abc>xyz</div>")
assert_equal 1, doc.search(".xyz").length
assert_equal 4, doc.search("*").length
end
def test_kleene_star
# bug noticed by raja bhatia
doc = Hpricot("<span class='small'>1</span><div class='large'>2</div><div class='small'>3</div><span class='blue large'>4</span>")
assert_equal 2, doc.search("*[@class*='small']").length
assert_equal 2, doc.search("*.small").length
assert_equal 2, doc.search(".small").length
assert_equal 2, doc.search(".large").length
end
def test_empty_comment
doc = Hpricot("<p><!----></p>")
assert doc.children[0].children[0].comment?
doc = Hpricot("<p><!-- --></p>")
assert doc.children[0].children[0].comment?
end
def test_body_newlines
@immob = Hpricot.parse(TestFiles::IMMOB)
body = @immob.at(:body)
{'background' => '', 'bgcolor' => '#ffffff', 'text' => '#000000', 'marginheight' => '10',
'marginwidth' => '10', 'leftmargin' => '10', 'topmargin' => '10', 'link' => '#000066',
'alink' => '#ff6600', 'hlink' => "#ff6600", 'vlink' => "#000000"}.each do |k, v|
assert_equal v, body[k]
end
end
def test_nested_twins
@doc = Hpricot("<div>Hi<div>there</div></div>")
assert_equal 1, (@doc/"div div").length
end
def test_wildcard
@basic = Hpricot.parse(TestFiles::BASIC)
assert_equal 3, (@basic/"*[@id]").length
assert_equal 3, (@basic/"//*[@id]").length
end
def test_javascripts
@immob = Hpricot.parse(TestFiles::IMMOB)
assert_equal 3, (@immob/:script)[0].inner_html.scan(/<LINK/).length
end
def test_nested_scripts
@week9 = Hpricot.parse(TestFiles::WEEK9)
assert_equal 14, (@week9/"a").find_all { |x| x.inner_html.include? "GameCenter" }.length
end
def test_uswebgen
@uswebgen = Hpricot.parse(TestFiles::USWEBGEN)
# sent by brent beardsley, hpricot 0.3 had problems with all the links.
assert_equal 67, (@uswebgen/:a).length
end
def test_mangled_tags
[%{<html><form name='loginForm' method='post' action='/units/a/login/1,13088,779-1,00.html'?URL=></form></html>},
%{<html><form name='loginForm' ?URL= method='post' action='/units/a/login/1,13088,779-1,00.html'></form></html>},
%{<html><form name='loginForm'?URL= ?URL= method='post' action='/units/a/login/1,13088,779-1,00.html'?URL=></form></html>},
%{<html><form name='loginForm' method='post' action='/units/a/login/1,13088,779-1,00.html' ?URL=></form></html>}].
each do |str|
doc = Hpricot(str)
assert_equal 1, (doc/:form).length
assert_equal '/units/a/login/1,13088,779-1,00.html', doc.at("form")['action']
end
end
def test_procins
doc = Hpricot("<?php print('hello') ?>\n<?xml blah='blah'?>")
assert_equal "php", doc.children[0].target
assert_equal "blah='blah'", doc.children[2].content
end
def test_buffer_error
assert_raise Hpricot::ParseError, "ran out of buffer space on element <input>, starting on line 3." do
Hpricot(%{<p>\n\n<input type="hidden" name="__VIEWSTATE" value="#{(("X" * 2000) + "\n") * 22}" />\n\n</p>})
end
end
def test_youtube_attr
str = <<-edoc
<html><body>
Lorem ipsum. Jolly roger, ding-dong sing-a-long
<object width="425" height="350">
<param name="movie" value="http://www.youtube.com/v/NbDQ4M_cuwA"></param>
<param name="wmode" value="transparent"></param>
<embed src="http://www.youtube.com/v/NbDQ4M_cuwA"
type="application/x-shockwave-flash" wmode="transparent" width="425" height="350">
</embed>
</object>
Check out my posting, I have bright mice in large clown cars.
<object width="425" height="350">
<param name="movie" value="http://www.youtube.com/v/foobar"></param>
<param name="wmode" value="transparent"></param>
<embed src="http://www.youtube.com/v/foobar"
type="application/x-shockwave-flash" wmode="transparent" width="425" height="350">
</embed>
</object>
</body></html?
edoc
doc = Hpricot(str)
assert_equal "http://www.youtube.com/v/NbDQ4M_cuwA",
doc.at("//object/param[@value='http://www.youtube.com/v/NbDQ4M_cuwA']")['value']
end
# ticket #84 by jamezilla
def test_screwed_xmlns
doc = Hpricot(<<-edoc)
<?xml:namespace prefix = cwi />
<html><body>HAI</body></html>
edoc
assert_equal "HAI", doc.at("body").inner_text
end
# Reported by Jonathan Nichols on the Hpricot list (24 May 2007)
def test_self_closed_form
doc = Hpricot(<<-edoc)
<body>
<form action="/loginRegForm" name="regForm" method="POST" />
<input type="button">
</form>
</body>
edoc
assert_equal "button", doc.at("//form/input")['type']
end
def test_filters
@basic = Hpricot.parse(TestFiles::BASIC)
assert_equal 0, (@basic/"title:parent").size
assert_equal 3, (@basic/"p:parent").size
assert_equal 1, (@basic/"title:empty").size
assert_equal 1, (@basic/"p:empty").size
end
def test_keep_cdata
str = %{<script> /*<![CDATA[*/
/*]]>*/ </script>}
assert_equal str, Hpricot(str).to_html
end
def test_namespace
chunk = <<-END
<a xmlns:t="http://www.nexopia.com/dev/template">
<t:sam>hi </t:sam>
</a>
END
doc = Hpricot::XML(chunk)
assert (doc/"//t:sam").size > 0 # at least this should probably work
# assert (doc/"//sam").size > 0 # this would be nice
end
end

@ -0,0 +1,16 @@
#!/usr/bin/env ruby
require 'test/unit'
require 'hpricot'
require 'load_files'
class TestParser < Test::Unit::TestCase
def test_roundtrip
@basic = Hpricot.parse(TestFiles::BASIC)
%w[link link[2] body #link1 a p.ohmy].each do |css_sel|
ele = @basic.at(css_sel)
assert_equal ele, @basic.at(ele.css_path)
assert_equal ele, @basic.at(ele.xpath)
end
end
end

@ -0,0 +1,66 @@
#!/usr/bin/env ruby
require 'test/unit'
require 'hpricot'
require 'load_files'
class TestPreserved < Test::Unit::TestCase
def assert_roundtrip str
doc = Hpricot(str)
yield doc if block_given?
str2 = doc.to_original_html
[*str].zip([*str2]).each do |s1, s2|
assert_equal s1, s2
end
end
def assert_html str1, str2
doc = Hpricot(str2)
yield doc if block_given?
assert_equal str1, doc.to_original_html
end
def test_simple
str = "<p>Hpricot is a <b>you know <i>uh</b> fine thing.</p>"
assert_html str, str
assert_html "<p class=\"new\">Hpricot is a <b>you know <i>uh</b> fine thing.</p>", str do |doc|
(doc/:p).set('class', 'new')
end
end
def test_parent
str = "<html><base href='/'><head><title>Test</title></head><body><div id='wrap'><p>Paragraph one.</p><p>Paragraph two.</p></div></body></html>"
assert_html str, str
assert_html "<html><base href='/'><body><div id=\"all\"><div><p>Paragraph one.</p></div><div><p>Paragraph two.</p></div></div></body></html>", str do |doc|
(doc/:head).remove
(doc/:div).set('id', 'all')
(doc/:p).wrap('<div></div>')
end
end
def test_escaping_of_contents
doc = Hpricot(TestFiles::BOINGBOING)
assert_equal "Fukuda\342\200\231s Automatic Door opens around your body as you pass through it. The idea is to save energy and keep the room clean.", doc.at("img[@alt='200606131240']").next.to_s.strip
end
def test_files
assert_roundtrip TestFiles::BASIC
assert_roundtrip TestFiles::BOINGBOING
assert_roundtrip TestFiles::CY0
end
def test_escaping_of_attrs
# ampersands in URLs
str = %{<a href="http://google.com/search?q=hpricot&amp;l=en">Google</a>}
link = (doc = Hpricot(str)).at(:a)
assert_equal "http://google.com/search?q=hpricot&l=en", link['href']
assert_equal "http://google.com/search?q=hpricot&l=en", link.attributes['href']
assert_equal "http://google.com/search?q=hpricot&l=en", link.get_attribute('href')
assert_equal "http://google.com/search?q=hpricot&amp;l=en", link.raw_attributes['href']
assert_equal str, doc.to_html
# alter the url
link['href'] = "javascript:alert(\"AGGA-KA-BOO!\")"
assert_equal %{<a href="javascript:alert(&quot;AGGA-KA-BOO!&quot;)">Google</a>}, doc.to_html
end
end

@ -0,0 +1,28 @@
#!/usr/bin/env ruby
require 'test/unit'
require 'hpricot'
require 'load_files'
class TestParser < Test::Unit::TestCase
# normally, the link tags are empty HTML tags.
# contributed by laudney.
def test_normally_empty
doc = Hpricot::XML("<rss><channel><title>this is title</title><link>http://fake.com</link></channel></rss>")
assert_equal "this is title", (doc/:rss/:channel/:title).text
assert_equal "http://fake.com", (doc/:rss/:channel/:link).text
end
# make sure XML doesn't get downcased
def test_casing
doc = Hpricot::XML(TestFiles::WHY)
assert_equal "hourly", (doc.at "sy:updatePeriod").inner_html
assert_equal 1, (doc/"guid[@isPermaLink]").length
end
# be sure tags named "text" are ok
def test_text_tags
doc = Hpricot::XML("<feed><title>City Poisoned</title><text>Rita Lee has poisoned Brazil.</text></feed>")
assert_equal "City Poisoned", (doc/"title").text
end
end