org.apache.lucene.wikipedia.analysis
Class WikipediaTokenizer
java.lang.Object
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.wikipedia.analysis.WikipediaTokenizer
public class WikipediaTokenizer
- extends Tokenizer
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the
Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
EXPERIMENTAL !!!!!!!!!
NOTE: This Tokenizer is considered experimental and the grammar is subject to change in the trunk and in follow up releases.
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Method Summary |
Token |
next(Token result)
Returns the next token in the stream, or null at EOS. |
void |
reset()
Resets this stream to the beginning. |
void |
reset(Reader reader)
Expert: Reset the tokenizer to a new reader. |
Methods inherited from class org.apache.lucene.analysis.Tokenizer |
close |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
INTERNAL_LINK
public static final String INTERNAL_LINK
- See Also:
- Constant Field Values
EXTERNAL_LINK
public static final String EXTERNAL_LINK
- See Also:
- Constant Field Values
EXTERNAL_LINK_URL
public static final String EXTERNAL_LINK_URL
- See Also:
- Constant Field Values
CITATION
public static final String CITATION
- See Also:
- Constant Field Values
CATEGORY
public static final String CATEGORY
- See Also:
- Constant Field Values
BOLD
public static final String BOLD
- See Also:
- Constant Field Values
ITALICS
public static final String ITALICS
- See Also:
- Constant Field Values
BOLD_ITALICS
public static final String BOLD_ITALICS
- See Also:
- Constant Field Values
HEADING
public static final String HEADING
- See Also:
- Constant Field Values
SUB_HEADING
public static final String SUB_HEADING
- See Also:
- Constant Field Values
WikipediaTokenizer
public WikipediaTokenizer(Reader input)
- Creates a new instance of the
WikipediaTokenizer
. Attaches the
input
to a newly created JFlex scanner.
- Parameters:
input
- The Input Reader
next
public Token next(Token result)
throws IOException
- Description copied from class:
TokenStream
- Returns the next token in the stream, or null at EOS.
When possible, the input Token should be used as the
returned Token (this gives fastest tokenization
performance), but this is not required and a new Token
may be returned. Callers may re-use a single Token
instance for successive calls to this method.
This implicitly defines a "contract" between
consumers (callers of this method) and
producers (implementations of this method
that are the source for tokens):
- A consumer must fully consume the previously
returned Token before calling this method again.
- A producer must call
Token.clear()
before setting the fields in it & returning it
Note that a TokenFilter
is considered a consumer.
- Overrides:
next
in class TokenStream
- Parameters:
result
- a Token that may or may not be used to return
- Returns:
- next token in the stream or null if end-of-stream was hit
- Throws:
IOException
reset
public void reset()
throws IOException
- Description copied from class:
TokenStream
- Resets this stream to the beginning. This is an
optional operation, so subclasses may or may not
implement this method. Reset() is not needed for
the standard indexing process. However, if the Tokens
of a TokenStream are intended to be consumed more than
once, it is necessary to implement reset().
- Overrides:
reset
in class TokenStream
- Throws:
IOException
reset
public void reset(Reader reader)
throws IOException
- Description copied from class:
Tokenizer
- Expert: Reset the tokenizer to a new reader. Typically, an
analyzer (in its reusableTokenStream method) will use
this to re-use a previously created tokenizer.
- Overrides:
reset
in class Tokenizer
- Throws:
IOException
Copyright © 2000-2008 Apache Software Foundation. All Rights Reserved.