Happy New 2012 Year!!!


Best of luck in new Year!

Posted in Posts | Leave a comment

C++ 11 regex based tokenizer

Tokenization is quite common task in many software applications. Tokenization is a process of splitting text into tokens. For example, text: “This is an example”. Might be split into following tokens: “this” “is” “an” “example”.

Recently I saw nice video from Stephan T. Lavavej where he presents regular expressions from C++’s STL library. In this video he mention an interesting use case, where regular expressions might be used to create tokenizers. Usually when writing tokenizers one would express rules for matching tokens, while in the proposed approach you are expected to write rules for matching token separators. Piece of the text that does not match regex for separator is your token.

I have tried to write something more complicated than just splitting text using whitespace characters as separators. Tokenization then became a two phase process:

  • At the first stage, tokenizer will split input text using whitespaces.
  • At the second stage, tokens will be matched against a set of regular expressions. This set contains regular expressions for matching:
    • numbers,
    • urls,
    • e-mail addresses,
    • dates.

    Once text from the first stage match one of the given regexes it will be converted into token of specific type. Otherwise the text will be split into sub-tokens using punctuation characters.

Here is an example. The text: “My birthday is 20/01/1984″, will be first split into following tokens:

“My” “birthday” “is” “20/01/1984″

Then each token will be matched against more restrictive regular expressions. “20/01/1984″ will be recognized as date.

As mentioned in the video there are specializations of regex type for string types: string and wstring. However, I wanted my tokenizer to work well with files. According to standard regex require bidirectional iterators. So, I’ve decided to write very simple implementation of a custom bidirectional file based iterator. It provides buffered access to ASCII text files and might be found in attached solution.

Downloads: Regex tokenizer C++ solution.

Final thoughts. Even though presented approach looks promising, regular expressions are very powerful in capturing various patterns, implementation appears to perform slow comparing to handwritten tokenizer. So, it is only practical for parsing small files. On my Dell Studio 1555 laptop (Intel Core 2 Duo P8700 2.53GHz with 4GB of RAM and x64 Windows 7) the speed of tokenization is only about 400-800 Kilobytes per second comparing to handwritten tokenizer reaching about 5-10 megabytes per second.

In my example I’ve used some regular epxressions from the following website: http://regexlib.com.

A good introduction to TR1 C++ regular expressions might be found here.

Posted in C++ Blacksmith | Leave a comment

Measure 9 minutes with 7 and 4 minutes hourglasses

Today, I’ve read a blog about this year oddest job interview questions. Among really odd questions there is actually an interesting one that gave me today about 5 minutes of pleasant thinking and about half of hour drawing the solution which am about to share.

Here is a puzzle: Using only a four-minute hourglass and a seven-minute hourglass, measure exactly nine minutes–without the process taking longer than nine minutes

Here is my solution:

Posted in Puzzles | Leave a comment

Approximation of circle with rectangles

Quad tree is an interesting data structure. It is used for solving various geometrical problems, such as fast collision detection, indexing of geospacial data, etc. I am going to show how it might be used for approximation of different 2D shapes. (There is also 3D version of quad trees, called octal tree).

Let’s start with a final result of the algorithm:

Small white rectangles approximate circle’s border. Red rectangles form an inner area of the circle. Yellow does not belong to the circle and might be removed.

Building quad tree is simple. We start from the bounding rectangle. Then it will be split into four smaller rectangles of the equal size. Let’s say, input rectangle is defined by following points (x1,y1) -> (x2,y2). Then rules for splitting it into four smaller rectangles are following:

  • (x1, y1) -> (x2-width/2,y2-height/2)
  • (x1, y1+height/2) -> (x1+width/2,y2)
  • (x2-width/2, y2-height/2) -> (x2,y2)
  • (x2-width/2, y1) -> (x2,y2-height/2)

This procedure might be recursively applied to all leafs in the tree until new leaf rectangles reach certain size.

In order to make a decision if given leaf rectangle should be split, we need a function f, which for a given rectangle r will return how many of its verteces belong to the figure. There are obviously four cases:

  • f(r) = 4 – Rectangle belongs to the inner area of the figure. No further split required. (Such rectangles are red)
  • f(r) = 3 or 2 – These rectangles touch the border of the figure and about half of the rectangle belongs to figure. (White rectangles)
  • f(r) = 1 – Less than half of the rectangle belongs to figure. (Gray rectangles)
  • f(r) = 0 – Rectangle lays outside the figure. No further split required. (Yellow rectangles)

For a circle, function f is quite simple:

  • f(x1,y1,x2,y2) = belongs(x1,y1)+belongs(x1,y2)+belongs(x2,y2)+belongs(x2,y2),

Where

  • belongs(x,y) = 1, if x*x+y*y< =r*r and 0 otherwise,
  • r – is the radius of the circle.

There is a C# solution that shows this basic algorithm: Download.

By modifying function f, one can adopt this algorithm for other 2D figures.

Posted in .NET Corner, Puzzles | 1 Comment

11:11:11 11.11.11

What a date! What a time! Celebrate!

Posted in Posts | Leave a comment