Creating the Keyword Extractor classes

The Keyword Extractor interface defines three abstract methods. These methods are self-explanatory, a Keyword Extractor class must be able to open a file, extract keywords (into a string), and subsequently close the file.

public interface IKeywordExtractor {

bool Open(string filePath); string ExtractKeywords(); bool Close();

We also create a KeywordExtractorBase class that offers common functionality across all keyword extractors. When you extract keywords from a file, you would most likely need to throw away common words that you don't need to index. For example, words like 'a', 'the', and 'an' do not need to be indexed. These words are called stop words. We can strip away stop words using Regular Expressions (as shown in the following highlighted function):

public class KeywordExtractorBase {

private string[] _stopWords;

public string RemoveStopWords(string RawText) {

string _regexPattern; int _counter;

for (_counter = 0; _counter <= (_stopWords.Length - 1); _counter++)

_stopWords[_counter] = "\\b" + _stopWords[_counter] + "\\b";

_regexPattern = "(" + string.Join("|", _stopWords) +

return Regex.Replace(RawText, _regexPattern, "", RegexOptions.IgnoreCase);

public KeywordExtractorBase() {

_stopWords = new string[] {"a", "and", "the", "of", "but"};

Was this article helpful?

0 0

Post a comment