Currently showing entries with the tag: development
|
page 1 of 2
|
.NET Interview Questions - Part 3
December 02, 2007 • 7:56PM • permalink
I received such an overwhelming response to my last two blog posts on .NET interview questions, that I decided to post a third.
Part 1 can be found here.
Part 2 can be found here.
Continuing from where we left off...
6. If placed in the Page_Load method of a ASP.NET page, what will the following code output?
Response.Write("<br />Before");
try
{
Response.Write("<br />In the 'try'");
int i = 0;
int j = 1 / i;
}
catch
{
Response.Write("<br />In the 'catch'");
Response.End();
return;
}
finally
{
Response.Write("<br />In the 'finally'");
}
Response.Write("<br />After");
Pretty simple question, right? Wrong!
I got it wrong the first time round too and even for the posting of this blog I made sure to execute the program and check the results!
You would see the following:
Before
In the 'try'
In the 'catch'
In the 'finally'
Remember that the finally clause will execute without exception (no pun intended). I tried to really drive that home by first executing Response.End, which even throws a second exception, and then executing a return function, in an attempt to leave the currently executing method.
Regardless of the return, the finally clause still executes before returning control to the return statement, preventing the display of the word "After".
7. Write a script to generate a dynamic image on a webpage, such as for use as a CAPTCHA, placing a watermark on an image or checking the referring url of a requested image?
For my example, I'll display 10 characters of randomly sized/styled/selected text in on a Red background. Note that I'm not going to introduce any warping, backgrounds or any other security features. This code is not intended for use as a real CAPTCHA and it would be trivial to write a OCR script to attack it.
I'm going to put the whole block of code without too much discussion. Most of the work is done by the GDI functions, which you can easily look up on MSDN. This would be placed in the OnLoad portion of a page and then called through a img object in the HTML like:
<img src="CaptchaImage.aspx" />
Note that we have previously defined the following helper structure to avoid repeated boxing/unboxing:
struct CaptchaCharacter
{
public char character;
public Font font;
}
The rest of the code follows:
int width = 600;
int height = 400;
int number_of_characters = 10;
string character_choices = "ABCDEFGHJKLMNPQRSTUVWXYZ23456789";
string[] font_families = { "Tahoma", "Arial", "Verdana" };
int[] font_sizes = { 36, 60, 84, 108 };
Rectangle bmp_rect = new Rectangle(0, 0, width, height);
Bitmap bmp = new Bitmap(width, height);
Graphics graphics = Graphics.FromImage(bmp);
graphics.SmoothingMode = SmoothingMode.AntiAlias;
graphics.FillRectangle(Brushes.Red, bmp_rect);
CaptchaCharacter[] character_array = new CaptchaCharacter[number_of_characters];
Random rnd = new Random();
for (int x = 0; x < number_of_characters; x++)
{
CaptchaCharacter new_char = new CaptchaCharacter();
new_char.character = character_choices[rnd.Next(0, character_choices.Length)];
new_char.font = new Font(font_families[rnd.Next(0, font_families.Length)],
font_sizes[rnd.Next(0, font_sizes.Length)]);
character_array[x] = new_char;
}
StringFormat format = new StringFormat();
format.Alignment = StringAlignment.Center;
format.LineAlignment = StringAlignment.Center;
GraphicsPath path = new GraphicsPath();
for (int a = 0; a < number_of_characters; a++)
{
RectangleF rect = new RectangleF((width / number_of_characters) * a,
0,
width / number_of_characters,
height);
path.AddString(character_array[a].character.ToString(),
character_array[a].font.FontFamily,
0,
character_array[a].font.SizeInPoints,
rect,
format);
}
graphics.FillPath(Brushes.Black, path);
Response.ContentType = "image/gif";
bmp.Save(Response.OutputStream, ImageFormat.Gif);
for (int z = 0; z < character_array.Length; z++)
character_array[z].font.Dispose();
path.Dispose();
graphics.Dispose();
bmp.Dispose();
First a Bitmap object is created, which is what we will eventually output. After obtaining a reference to it's GDI Graphics object, we begin drawing on it. First a background rectangle with a Red brush is drawn and then a GraphicsPath object is created. We can use the built-in AddString method of the GraphicsPath to easily style and add our characters. We could have easily output the whole string at once, but we loop through each character to apply individual styling of FontFamily and font size to each character. Finally, we change the ResponseType of the our encapsulating page and save the bitmap to the built-in OutputStream (which will block all other output to the page).
Lately, I've seen a lot of really bad SQL come through the office on interviews. In our extensive interview process, many of the other developers focus on simple SQL problems, which is really all that is necessary for the day-to-day job at Demand.
Unlike some of the other developers, my boss constantly chastises me for worrying about security too much. I can't deny that I do obsess about security too much, given my background, but because of that I'll occasionally ask the following question, which I think any SQL developer should be able to answer:
8. Given a simple login box (with username and password fields), what input will compromise the database in a susceptible system?
I'll even go so far as to show you the poorly written code that will allow this... (Note that the code is looking for the password of the given user and will check it in C# code below, that's all it takes to allow an exploit).
string sql = string.Format(@"
SELECT
password
FROM [dbo].[Accounts]
WHERE username='{0}' ", Request.Form["username"]);
DataTable dt = new DataTable();
SqlConnection connection = new SqlConnection(connection_string);
SqlCommand command = new SqlCommand(sql, connection);
command.CommandType = CommandType.Text;
connection.Open();
SqlDataReader sdr = command.ExecuteReader(CommandBehavior.CloseConnection);
dt.Load(sdr);
sdr.Close(); //this will close the connection too
if (dt.Rows.Count > 0)
if (dt.Rows[0]["password"].ToString() == Request.Form["password"])
LoginUser();
First, the exploit. There are an infinite number of things you can do with a SQL Injection, but we'll use the simple input:
' AND 0=1 UNION SELECT '123456' -- in the username field and 123456 in the password field.
This turns the executed query into:
SELECT
password
FROM [dbo].[Accounts]
WHERE username='' AND 0=1 UNION SELECT '123456' --'
First, you'll note that the -- placed at the end will comment out the original query ending, including the single-quote. The end result has the WHERE-clause being interpreted as username='' AND 0=1. Obviously, the AND 0=1 portion will cause the entire clause to return FALSE. At this point, we UNION a literal '123456', which will allow us access to the site. (Note that this is a very simple example, in most cases you would most likely be selecting back the matching user account and hence could theoretically login to any account.)
Some may argue that I made the impossible possible by revealing the original source code, but that's not necessarily true. For anyone that's attempting a SQL-injection, it's most likely not a large leap to write a script to brute force the parameters of the victim query. At that point, you can literally do whatever you want by using a little ingenuity and the INFORMATION_SCHEMA object, supported by most RDMS.
.NET Quickies
* Using a method of the String object, what is the optimized .NET way of performing the (often executed) compound conditional:
if (some_string != null && some_string != "")
DoSomething();
String.IsNullOrEmpty()
(in my tests for this blog entry, it consistently performed 40-45% faster)
* When encoding data, what is the key overall difference between hashing and encrypting?
Hashing is a one-way mapping, while encryption has a corresponding decryption which will reverse the process.
* What is the effect of making a method of a class static and what might it's use be?
Static methods are not associated with any one instance of the class, nor are they able to access any instance fields of a class. Thus, instead of invoking the methods through an instance call, you use the name of the class instead (since you are referencing the single Type object of that class maintained by .NET), like so:
string s = "some test string";
bool starts_with_some = s.StartsWith("some");
bool not_null_or_empty = string.IsNullOrEmpty(s);
Static methods allow you to provide stand-alone methods that relate to a classes functionality. Another example might be a Country class. I might use it to represent a single country object, with fields/properties like CountryID, Name or ZipCodeList. I might also include a method to use the current class' data like GetIPRange() or FindContinent(). Finally, I could also add stand-alone (static) methods, like Country.GetAllCountries() to return a List containing the name of every country on Earth.
I want to add the additional note that since I've been seeing an increase in the number of "demand media" interview questions Google searches hit my blog, we have been working on restructuring our interview process to change the questions around and are now working towards a much more hands-on interview. Note that part of the review process includes reviewing my blog for any questions and removing them (or limiting the use of them) from our interview process. So make sure you know how to use .NET in ways outside the scope of these questions.

I also want to encourage people to continue contacting me with your questions and comments. As long as there is an interest in the topic, I will continue to present real-life .NET interview questions.
Part 1 can be found here.
Part 2 can be found here.
Continuing from where we left off...
6. If placed in the Page_Load method of a ASP.NET page, what will the following code output?
Response.Write("<br />Before");
try
{
Response.Write("<br />In the 'try'");
int i = 0;
int j = 1 / i;
}
catch
{
Response.Write("<br />In the 'catch'");
Response.End();
return;
}
finally
{
Response.Write("<br />In the 'finally'");
}
Response.Write("<br />After");
Pretty simple question, right? Wrong!
I got it wrong the first time round too and even for the posting of this blog I made sure to execute the program and check the results!
You would see the following:
Before
In the 'try'
In the 'catch'
In the 'finally'
Remember that the finally clause will execute without exception (no pun intended). I tried to really drive that home by first executing Response.End, which even throws a second exception, and then executing a return function, in an attempt to leave the currently executing method.
Regardless of the return, the finally clause still executes before returning control to the return statement, preventing the display of the word "After".
7. Write a script to generate a dynamic image on a webpage, such as for use as a CAPTCHA, placing a watermark on an image or checking the referring url of a requested image?
For my example, I'll display 10 characters of randomly sized/styled/selected text in on a Red background. Note that I'm not going to introduce any warping, backgrounds or any other security features. This code is not intended for use as a real CAPTCHA and it would be trivial to write a OCR script to attack it.
I'm going to put the whole block of code without too much discussion. Most of the work is done by the GDI functions, which you can easily look up on MSDN. This would be placed in the OnLoad portion of a page and then called through a img object in the HTML like:
<img src="CaptchaImage.aspx" />
Note that we have previously defined the following helper structure to avoid repeated boxing/unboxing:
struct CaptchaCharacter
{
public char character;
public Font font;
}
The rest of the code follows:
int width = 600;
int height = 400;
int number_of_characters = 10;
string character_choices = "ABCDEFGHJKLMNPQRSTUVWXYZ23456789";
//NUMBERS 0+1, LETTERS I+O removed for legibility reasons
string[] font_families = { "Tahoma", "Arial", "Verdana" };
int[] font_sizes = { 36, 60, 84, 108 };
Rectangle bmp_rect = new Rectangle(0, 0, width, height);
Bitmap bmp = new Bitmap(width, height);
Graphics graphics = Graphics.FromImage(bmp);
graphics.SmoothingMode = SmoothingMode.AntiAlias;
graphics.FillRectangle(Brushes.Red, bmp_rect);
CaptchaCharacter[] character_array = new CaptchaCharacter[number_of_characters];
Random rnd = new Random();
for (int x = 0; x < number_of_characters; x++)
{
CaptchaCharacter new_char = new CaptchaCharacter();
new_char.character = character_choices[rnd.Next(0, character_choices.Length)];
new_char.font = new Font(font_families[rnd.Next(0, font_families.Length)],
font_sizes[rnd.Next(0, font_sizes.Length)]);
character_array[x] = new_char;
}
StringFormat format = new StringFormat();
format.Alignment = StringAlignment.Center;
format.LineAlignment = StringAlignment.Center;
GraphicsPath path = new GraphicsPath();
for (int a = 0; a < number_of_characters; a++)
{
RectangleF rect = new RectangleF((width / number_of_characters) * a,
0,
width / number_of_characters,
height);
path.AddString(character_array[a].character.ToString(),
character_array[a].font.FontFamily,
0,
character_array[a].font.SizeInPoints,
rect,
format);
}
graphics.FillPath(Brushes.Black, path);
Response.ContentType = "image/gif";
bmp.Save(Response.OutputStream, ImageFormat.Gif);
//we dispose all the Graphics objects
for (int z = 0; z < character_array.Length; z++)
character_array[z].font.Dispose();
path.Dispose();
graphics.Dispose();
bmp.Dispose();
First a Bitmap object is created, which is what we will eventually output. After obtaining a reference to it's GDI Graphics object, we begin drawing on it. First a background rectangle with a Red brush is drawn and then a GraphicsPath object is created. We can use the built-in AddString method of the GraphicsPath to easily style and add our characters. We could have easily output the whole string at once, but we loop through each character to apply individual styling of FontFamily and font size to each character. Finally, we change the ResponseType of the our encapsulating page and save the bitmap to the built-in OutputStream (which will block all other output to the page).
Lately, I've seen a lot of really bad SQL come through the office on interviews. In our extensive interview process, many of the other developers focus on simple SQL problems, which is really all that is necessary for the day-to-day job at Demand.
Unlike some of the other developers, my boss constantly chastises me for worrying about security too much. I can't deny that I do obsess about security too much, given my background, but because of that I'll occasionally ask the following question, which I think any SQL developer should be able to answer:
8. Given a simple login box (with username and password fields), what input will compromise the database in a susceptible system?
I'll even go so far as to show you the poorly written code that will allow this... (Note that the code is looking for the password of the given user and will check it in C# code below, that's all it takes to allow an exploit).
string sql = string.Format(@"
SELECT
password
FROM [dbo].[Accounts]
WHERE username='{0}' ", Request.Form["username"]);
DataTable dt = new DataTable();
SqlConnection connection = new SqlConnection(connection_string);
SqlCommand command = new SqlCommand(sql, connection);
command.CommandType = CommandType.Text;
connection.Open();
SqlDataReader sdr = command.ExecuteReader(CommandBehavior.CloseConnection);
dt.Load(sdr);
sdr.Close(); //this will close the connection too
if (dt.Rows.Count > 0)
if (dt.Rows[0]["password"].ToString() == Request.Form["password"])
LoginUser();
First, the exploit. There are an infinite number of things you can do with a SQL Injection, but we'll use the simple input:
' AND 0=1 UNION SELECT '123456' -- in the username field and 123456 in the password field.
This turns the executed query into:
SELECT
password
FROM [dbo].[Accounts]
WHERE username='' AND 0=1 UNION SELECT '123456' --'
First, you'll note that the -- placed at the end will comment out the original query ending, including the single-quote. The end result has the WHERE-clause being interpreted as username='' AND 0=1. Obviously, the AND 0=1 portion will cause the entire clause to return FALSE. At this point, we UNION a literal '123456', which will allow us access to the site. (Note that this is a very simple example, in most cases you would most likely be selecting back the matching user account and hence could theoretically login to any account.)
Some may argue that I made the impossible possible by revealing the original source code, but that's not necessarily true. For anyone that's attempting a SQL-injection, it's most likely not a large leap to write a script to brute force the parameters of the victim query. At that point, you can literally do whatever you want by using a little ingenuity and the INFORMATION_SCHEMA object, supported by most RDMS.
.NET Quickies
* Using a method of the String object, what is the optimized .NET way of performing the (often executed) compound conditional:
if (some_string != null && some_string != "")
DoSomething();
String.IsNullOrEmpty()
(in my tests for this blog entry, it consistently performed 40-45% faster)
* When encoding data, what is the key overall difference between hashing and encrypting?
Hashing is a one-way mapping, while encryption has a corresponding decryption which will reverse the process.
* What is the effect of making a method of a class static and what might it's use be?
Static methods are not associated with any one instance of the class, nor are they able to access any instance fields of a class. Thus, instead of invoking the methods through an instance call, you use the name of the class instead (since you are referencing the single Type object of that class maintained by .NET), like so:
string s = "some test string";
bool starts_with_some = s.StartsWith("some");
//StartsWith uses the instance s
bool not_null_or_empty = string.IsNullOrEmpty(s);
//IsNullOrEmpty is a static method
Static methods allow you to provide stand-alone methods that relate to a classes functionality. Another example might be a Country class. I might use it to represent a single country object, with fields/properties like CountryID, Name or ZipCodeList. I might also include a method to use the current class' data like GetIPRange() or FindContinent(). Finally, I could also add stand-alone (static) methods, like Country.GetAllCountries() to return a List
I want to add the additional note that since I've been seeing an increase in the number of "demand media" interview questions Google searches hit my blog, we have been working on restructuring our interview process to change the questions around and are now working towards a much more hands-on interview. Note that part of the review process includes reviewing my blog for any questions and removing them (or limiting the use of them) from our interview process. So make sure you know how to use .NET in ways outside the scope of these questions.
I also want to encourage people to continue contacting me with your questions and comments. As long as there is an interest in the topic, I will continue to present real-life .NET interview questions.
0 comments
Reflection on ASP.NET Auto-Compiled Classes
October 12, 2007 • 8:16AM • permalink
I came across a unique situation yesterday that took awhile to figure out, but I thought it was a really cool concept!
The basic idea is that I have an ASP.NET website that references a DLL. The DLL contains an interface that other classes can implement, with the general idea of allowing external classes (external to the DLL) to act as "plug-ins". The logical location to place these classes is in the App_Code folder, since it will auto-compile the classes and make them available globally, but that's when I ran into a problem...
The DLL also contains a static class to populate a static collection of the classes, so that they can be referenced by name. Since the classes act as "plug-ins", they should be able to be modified at any time, as well as allow for new classes to be dropped into the App_Code folder. The only way to deal with a situation like this is with Reflection.
So, I included a reference to System.Reflection and tried loading the type information for one using Type.GetType(). That failed miserably as the return value was null. I thought for a minute and then wrapped the class placed in App_Code in a unique namespace. I went back to my call to Type.GetType() and tried referencing the class using this namespace. Again, a NullReferenceException.
How did I get around this issue? The solution is actually VERY VERY simple! You just need to get a reference to the Assembly that the App_Code folder gets compiled into by using a call to Assembly.Load("App_Code"). After that, you can use the returned assembly reference in order to get the class type. So, if I have a class named AdamWidget in my App_Code folder that implements the IWidget interface from my DLL. The code in the foreign DLL to load a Type instance for that class could be:
Assembly asm = Assembly.Load("App_Code");
Type module_type = asm.GetType("AdamWidget");
if (module_type.GetInterface("IWidget") != null)
{
DoSomething();
}
That's all there is to it! Now our web application can import (through the DLL) any class that implements IWidget in our App_Code folder!
Please also note that I discovered later that you can also reference the same dynamic assembly with a call to Assembly.Load("__code").
The basic idea is that I have an ASP.NET website that references a DLL. The DLL contains an interface that other classes can implement, with the general idea of allowing external classes (external to the DLL) to act as "plug-ins". The logical location to place these classes is in the App_Code folder, since it will auto-compile the classes and make them available globally, but that's when I ran into a problem...
The DLL also contains a static class to populate a static collection of the classes, so that they can be referenced by name. Since the classes act as "plug-ins", they should be able to be modified at any time, as well as allow for new classes to be dropped into the App_Code folder. The only way to deal with a situation like this is with Reflection.
So, I included a reference to System.Reflection and tried loading the type information for one using Type.GetType(). That failed miserably as the return value was null. I thought for a minute and then wrapped the class placed in App_Code in a unique namespace. I went back to my call to Type.GetType() and tried referencing the class using this namespace. Again, a NullReferenceException.
How did I get around this issue? The solution is actually VERY VERY simple! You just need to get a reference to the Assembly that the App_Code folder gets compiled into by using a call to Assembly.Load("App_Code"). After that, you can use the returned assembly reference in order to get the class type. So, if I have a class named AdamWidget in my App_Code folder that implements the IWidget interface from my DLL. The code in the foreign DLL to load a Type instance for that class could be:
Assembly asm = Assembly.Load("App_Code");
Type module_type = asm.GetType("AdamWidget");
if (module_type.GetInterface("IWidget") != null)
{
DoSomething();
}
That's all there is to it! Now our web application can import (through the DLL) any class that implements IWidget in our App_Code folder!
Please also note that I discovered later that you can also reference the same dynamic assembly with a call to Assembly.Load("__code").
Book Review: Programming Erlang
October 23, 2007 • 8:54PM • permalink
As you may know from reading my previous blog entries, I've recently been trying to mix up the books in my reading queue by exploring the benefits of a few new programming languages. Recently a friend of mine told me a few things about functional programming, concurrency and Erlang which inspired me to check it out.
Joe Armstrong, one of the creators of the Erlang language, has recently published a new book on the topic and its truly a fascinating read.
One of the biggest complaints I hear from other developers about any introductory level book about a new programming language is the lack of useful programs that can be created upon completion of the book. What I mean is that you may be able to do the normal "Hello World", Fibonacci sequence output, etc. - but you probably won't be able to do anything really useful. This isn't the case with Armstrong's book.
The main ideas of the book and of Erlang (free for most environments at http://www.erlang.org) in general is concurrency or the simultaneous execution of code. While C and its variants offer multi-threading, it is still essentially executing sequential code and hence subject to deadlocks, race-conditions, etc. Erlang enforces some strict rules from the get-go to support concurrency and allow for true simultaneous execution that is free from not only race-conditions and deadlocks, but from semaphores, mutexes and locks as well.
In order to do that, Erlang turns computer science on its head (from a sequential programming point-of-view) which Armstrong is quick to point out at every turn. For example, variables are only called as such because it makes things easier. The fact is they can't actually vary and an Exception is thrown if you even try.
On that same note, the equal sign is not the assignment operator like it usually is in programming, it is instead used to perform pattern matching. An example would probably work best to explain it:
1>X = 5.
5
2>X = 5.
5
3>X = 3.
=ERROR REPORT==== 16-Oct-2007::21:26:12 ===
Error in process <0.30.0> with exit value: {{badmatch,3},[{erl_eval,expr,3}]}
** exited: {{badmatch,3},[{erl_eval,expr,3}]} **
As I stated before, the = is not the assignment operator, it is instead used for pattern matching. An additional caveat though, when used with an uninitialized variable (which usually start with Capital letters), the variable is assigned that value. In line 1 above, we match the variable X against the literal value 5. Since X is uninitialized at the time we match it against the literal 5, it then takes on that value.
This is why, when we repeat the action in the second line, it returns the value 5 (indicating a match). Consequently, when we hit the third line, the shell throws an exception since we're matching X against 3, when it has already taken on the value 5.
You may wonder what the value of such an operator is, but when you dive into server programming, you'll see that it can be used (among other things) to direct functionality within network protocols. By matching against certain patterns, you can essentially code mini-conditional statements to perform various actions upon receipt of certain data. It seems complicated, but it's really not - since this is a Book Review and not an Introduction to Erlang, so I'll leave the explanation to Armstrong...
Within 75 pages, I had gained an enthusiasm for Erlang that was apparently infectious and it has been embraced by a few other developers in my Department (including Jon over at Rusty Razor Blade). While we're still not 100% sure it will be able to support the traffic load and perform fast enough, we're motivated enough to try. Armstrong makes it easy to learn from example too, since the book contains Erlang source to create a server for almost any major network protocol or project you could think of including IRC, SHOUTcast, a simple Error logger, a SQL Server and a Web Server.
We've discussed a plethora of different ideas we could run on our "new Erlang Framework", some more ambitious than others. We're convinced we could rewrite memcached in a few hundred lines of code, including all the features of the original - a few that are missing and commonly implemented in other cache systems - as well as a few of our own custom design. We figure if we can even get 80% of the throughput of our current memcached implementation than it will be worth our trouble. Especially if it enables us to build a Erlang Framework to support any scalable idea we can come up with.
So if the idea of scalable server applications, functional programming or Erlang in general seems interesting, I highly suggest you check out Programming Erlang.
Joe Armstrong, one of the creators of the Erlang language, has recently published a new book on the topic and its truly a fascinating read.
One of the biggest complaints I hear from other developers about any introductory level book about a new programming language is the lack of useful programs that can be created upon completion of the book. What I mean is that you may be able to do the normal "Hello World", Fibonacci sequence output, etc. - but you probably won't be able to do anything really useful. This isn't the case with Armstrong's book.
The main ideas of the book and of Erlang (free for most environments at http://www.erlang.org) in general is concurrency or the simultaneous execution of code. While C and its variants offer multi-threading, it is still essentially executing sequential code and hence subject to deadlocks, race-conditions, etc. Erlang enforces some strict rules from the get-go to support concurrency and allow for true simultaneous execution that is free from not only race-conditions and deadlocks, but from semaphores, mutexes and locks as well.
In order to do that, Erlang turns computer science on its head (from a sequential programming point-of-view) which Armstrong is quick to point out at every turn. For example, variables are only called as such because it makes things easier. The fact is they can't actually vary and an Exception is thrown if you even try.
On that same note, the equal sign is not the assignment operator like it usually is in programming, it is instead used to perform pattern matching. An example would probably work best to explain it:
1>X = 5.
5
2>X = 5.
5
3>X = 3.
=ERROR REPORT==== 16-Oct-2007::21:26:12 ===
Error in process <0.30.0> with exit value: {{badmatch,3},[{erl_eval,expr,3}]}
** exited: {{badmatch,3},[{erl_eval,expr,3}]} **
As I stated before, the = is not the assignment operator, it is instead used for pattern matching. An additional caveat though, when used with an uninitialized variable (which usually start with Capital letters), the variable is assigned that value. In line 1 above, we match the variable X against the literal value 5. Since X is uninitialized at the time we match it against the literal 5, it then takes on that value.
This is why, when we repeat the action in the second line, it returns the value 5 (indicating a match). Consequently, when we hit the third line, the shell throws an exception since we're matching X against 3, when it has already taken on the value 5.
You may wonder what the value of such an operator is, but when you dive into server programming, you'll see that it can be used (among other things) to direct functionality within network protocols. By matching against certain patterns, you can essentially code mini-conditional statements to perform various actions upon receipt of certain data. It seems complicated, but it's really not - since this is a Book Review and not an Introduction to Erlang, so I'll leave the explanation to Armstrong...
Within 75 pages, I had gained an enthusiasm for Erlang that was apparently infectious and it has been embraced by a few other developers in my Department (including Jon over at Rusty Razor Blade). While we're still not 100% sure it will be able to support the traffic load and perform fast enough, we're motivated enough to try. Armstrong makes it easy to learn from example too, since the book contains Erlang source to create a server for almost any major network protocol or project you could think of including IRC, SHOUTcast, a simple Error logger, a SQL Server and a Web Server.
We've discussed a plethora of different ideas we could run on our "new Erlang Framework", some more ambitious than others. We're convinced we could rewrite memcached in a few hundred lines of code, including all the features of the original - a few that are missing and commonly implemented in other cache systems - as well as a few of our own custom design. We figure if we can even get 80% of the throughput of our current memcached implementation than it will be worth our trouble. Especially if it enables us to build a Erlang Framework to support any scalable idea we can come up with.
So if the idea of scalable server applications, functional programming or Erlang in general seems interesting, I highly suggest you check out Programming Erlang.
Basic Regular Expressions
September 30, 2007 • 3:13PM • permalink
Originally, I considered Regular Expressions to be a bonus skill. Something that was nice if developers had it, but not a necessity. Recently it seems that things would go a lot smoother for me if the people around me knew RegEx, but I've been surprised by how few people do. It might have something to do with there being only one decent book (that I've seen) on the subject.
So, I decided to write this small tutorial to give a basic description of RegEx. I will be using .NET for the examples, but the patterns themselves should be valid in most Regular Expression implementations.
For some of our examples below we will require a larger block of example text, I'm going to use a paragraph of some random Lorem Ipsum text, with some random punctuation thrown in:
string lipsum = "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Duis non nulla id sapien molestie pulvinar. 'Nulla vitae risus vel quam imperdiet egestas!' Vestibulum fringilla consequat pede. Quisque tortor lectus, rhoncus ut, posuere vel, rhoncus in, tellus. Fusce mi. Curabitur eget augue sit amet lorem iaculis sagittis. Nam et massa. Nunc sagittis, libero et eleifend aliquet, mi sem varius orci, sit amet sagittis turpis est nec dolor? Nulla facilisi. Proin volutpat erat a sem. Maecenas nibh libero, euismod at, consequat quis, rutrum in, turpis. Aenean erat enim, fermentum a, luctus non, bibendum in, tellus. In ac libero. Suspendisse potenti. Pellentesque tincidunt dignissim mi.";
First, a simple overview of some of the simpler RegEx constructs:
Literal Text
It's simple. Literal text matches exactly. If text is not modified by Regular Expression constructs (punctuation), then it should be taken literally. So a search for the Regex pattern adam will look for my name in a block of text.
[] - Character Classes and Ranges
Character classes are used to indicate exact character matches. For example, if I want to match all the vowels in the above text I can use:
[aeiou]
This will match exactly one character (a vowel) in any block of text that it is matched against. I'll use this pattern to illustrate the simplest way to collect all matches in a RegEx expression.
using System.Text.RegularExpressions;
RegexOptions options = RegexOptions.IgnorePatternWhitespace;
options |= RegexOptions.IgnoreCase;
Regex regex = new Regex("[aeiou]", options);
MatchCollection mc = regex.Matches(lipsum);
(Please note that to conserve space, I won't repeat the calls to add the RegexOptions, but you can assume that IgnorePatternWhitespace and RegexOptions.IgnoreCase were used in each example.)
After the snippet is run, the MatchCollection object, mc, holds all the vowel matches in the Lorem ipsum text (o, e, i, u, o, o, i and so on). Note that without the RegexOptions.IgnoreCase option, our pattern would need to be [AEIOUaeiou] in order to match the uppercase letters as well.
Character classes can also be used to contain ranges. To match against any letter in the alphabet, the following character class can be used:
[A-Za-z]
When matched against the lipsum text, this will match once for each single letter. Some other example character classes are:
[0-9] Any number
[a-ep-z] Any letter between 'a' and 'e' (inclusive) or between 'p' and 'z' (inclusive)
[A-Za-z0-9] Any letter or number
Note that the above will literally match one single character each time. If we need to match more characters, we can use additional aspects of Regex to indicate that.
First, we will look at the * modifier. This indicates that the preceding match component should be matched zero or more times. (The zero is important because it means that empty strings will match as well.) For example:
The regular expression [a-z]* will match all of the following:
a
d
adam
lorem
abcdefghijklmnopqrstuvwxyz
We can change the * into a + to match the preceding component one or more times. That is the difference between the * and the +. The * can have 0 matches and still satisfy the regular expression, while the + requires at least one physical match to be considered a valid match. For example, in the lipsum text above, the pattern [0-9]* would match, but [0-9]+ wouldn't, since the former is matched by empty strings. It is also important to note thatm by default, regular expressions are greedy and try to match as many characters as possible. Both the * and the + will include as many characters in their matches as they can.
Additionally, we can also use the ? which will match zero or one of the preceding pattern, essentially making it an "optional match". So:
L?orem?
will match any of the following:
Lorem
Bored
ore
Additional Modifiers
^ (Caret)
This modifier can be used in two very different ways. The first way is at the beginning of the inside of a character class. If present, the caret negates the meaning of the character class and instead matches ANY character except those inside of the brackets.
If we match the above lipsum text against the pattern [^sjdhflo ]+ (note that it includes the space) we are matching against one or more characters in a row that are not s, j, d, h, f, l, o or space.
If we actually match the above pattern we see many results (239 in all), such as rem, ip, um and so on.
^ (Caret) Part 2
The ^ can also be used outside of character classes, but only at the very beginning of a RegEx pattern. When present, it anchors the pattern to the start of a line of text. This is either the start of a string or after each hard line break. So, in the above lipsum text, the pattern ^lorem would only return one match, even though the word appears in the text twice. (Note that because the above does not have any hard line breaks - only soft line breaks caused by the formatting of the webpage - the ^ will only match at the beginning of the entire block of text.)
$
The $ is used only at the end of a RegEx pattern, in a use that is opposite that of the ^ shown above. The $ indicates that pattern is anchored to the end of a line of text. See the following example:
string s = "Sally sells seashells by the sea";
Regex regex = new Regex("sea$");
MatchCollection mc = regex.Matches(s);
In the above code, mc will only contain one match. This will be the one at the end, since it has the $. If we remove the $, it will instead have two matches. Like so:
string s = "Sally sells seashells by the sea";
Regex regex = new Regex("sea");
MatchCollection mc = regex.Matches(s);
Note that if we change the original string to have proper punctuation at the end (the period) the original sea$ pattern will not match at all.
string s = "Sally sells seashells by the sea.";
Regex regex = new Regex("sea$");
MatchCollection mc = regex.Matches(s);
This is because the end of pattern is looking for the word sea at the end of the block of text. Because the period is there and we're not looking for it, our pattern won't match!
There are a number of ways to fix this problem, but a general punctuation match will do the trick.
string s = "Sally sells seashells by the sea.";
Regex regex = new Regex("sea[!?.,-]$", options);
MatchCollection mc = regex.Matches(s);
Note the placement of the hyphen (-) at the end of the character class so that it isn't accidentally misinterpreted as a range of characters.
Note also the caveat that although the period outside of the character class WOULD match, it wouldn't do exactly what you think. Can you guess what the following example will print?
string s = "Sally sells seashells by the seaX";
Regex regex = new Regex("sea.$", options);
MatchCollection mc = regex.Matches(s);
if (mc.Count == 0)
Response.Write("0 matches!");
else if (mc.Count == 1)
Response.Write("1 match!");
If you guessed "1 match!", you're correct - but do you know why? The answer lies in our next modifier.
. (Period)
The . is used to represent ANY character, but at least one character must be in the matching position. It's really just a placeholder to say that "something" has to be in the position indicated.
In the example given above, the X fills the position that the . is in, so we get our "1 match!".
Escaped Characters and Shortcuts
The \ backslash is an extremely powerful character in RegEx pattern matching with a plethora of uses. First, it is used to escape special characters to be taken as literals within Regular Expression patterns. For example, if we were to search for text within parenthesis, we would need to escape the parenthesis, since they are used within RegEx patterns to form groups, as I'll explain below.
To search for the word Trunks in parenthesis the pattern would be:
\(Trunks\)
Please note in the following example which shows a common problem for beginners using RegEx in .NET:
string s = "This sentence is about my cat (Trunks).";
Regex regex = new Regex("\\(Trunks\\)");
MatchCollection mc = regex.Matches(s);
Note that in the above code I used two backslashes before the parenthesis. Can you figure out why? Here's an alternative way of writing it which might give you a hint:
Regex regex = new Regex(@"\(Trunks\)");
In .NET, as well as many other languages, characters within strings can be escaped by the backslash character. \n is the newline, \t is the tab, \r is the carriage return and so on. The .NET string that you're using for the RegEx pattern is first interpreted by the .NET parser, so any escaped characters will already have been escaped by the time the RegEx parser takes control and the pattern will not match correctly.
In order to get around that problem, we first escape the backslash itself, by using a \\ construct. As mentioned above (and below) the parenthesis is used as a special character in RegEx and needs to be escaped if you intend to use it as a literal.
In the pattern \\(Trunks\\), the backslash is first escaped as a .NET string so that the RegEx parser actually sees \(Trunks\). This escapes the parenthesis in the RegEx parsing and correctly finds the parenthesis. As an additional note, if you're unclear as to why the @"\(Trunks\)" works, it is a C# verbatim string. In a verbatim string, all characters are automatically escaped (except for the double-quote character, which you represent by doubling up "") and you can include formatting like tabs and newlines.
Please remember, this is for our .NET example and won't apply to all languages. If you're not using .NET, check with your language reference to see if you need to escape the backslashes.
RegEx Escaped Characters
RegEx has its own escape strings that are mostly used as shortcuts for a large ranges of characters. Like many other languages, you escape RegEx characters by using a backslash.
Note that in .NET you need to take the above caveat into considering when using these, so, for example, to search for a literal backslash, you would need to escape it in both .NET and RegEx: \\\\. This is because first .NET will escape it into \\, then RegEx will escape it into \ and search for the literal value.
Here are a few examples:
\d The same thing as [0-9].
\w The same thing as [a-zA-Z_0-9].
\s This will match against any whitespace.
The capital versions offer negations...
\D The same thing as [^0-9].
\W The same thing as [^a-zA-Z_0-9].
\S This will match against any character EXCEPT whitespace.
There are many other, less used, escaped characters. Be sure to check with your RegEx implementation's documentation (click here for the .NET resource.)
Note again that the backslash should be double-escaped when necessary.
Grouping
In the teaching of Regular Expressions, Grouping is usually considered an advanced topic and not taught in a first lesson. Personally, I don't see the point of learning how to use RegEx unless you can use it!
A group is automatically formed everytime matching pairs of parenthesis are used in a pattern. Each match has an automatic group of the entire match and then each subsequent parenthesis pair, as shown in the following example:
string s = "Mississippi";
Regex regex = new Regex("([aeiou][s]+)");
MatchCollection mc = regex.Matches(s);
The MatchCollection.Count property would be 2, indicating that the pattern matched twice (on iss both times, since we're matching any vowel followed by the letter s one or more times). If we examine the MatchCollection, we see it is made of Match objects, with a Groups property, another collection.
In all matches, the first Group (mc[0].Groups[0]) always contains the entire match, in this case iss. Actually, in this case, the second Group (mc[0].Groups[1]) also contains iss. This is because we're grouping the entire match and the results of our match was iss. You would see the exact same results in the second Match object since iss appears twice in the word Mississippi. Hence both mc[1].Groups[0] and mc[1].Groups[1] would contain the string iss.
If we change the parenthesis slightly and only group the vowel:
string s = "Mississippi";
Regex regex = new Regex("([aeiou])[s]+");
MatchCollection mc = regex.Matches(s);
The MatchCollection.Count property would still be 2 with the same exact matches. Also, the Groups[0] property would still contain iss, since that's our entire match. However, the Groups[1].Value would now contain only the i, since that's the entire match inside our parenthesis.
We'll change it slightly one more time:
string s = "Mississippi";
Regex regex = new Regex("([aeiou])(s)s");
MatchCollection mc = regex.Matches(s);
Our pattern is trying to match a vowel followed by two letter s characters. We are grouping the vowel by itself and additionally grouping the first of two s characters.
When we run the above code we get two matches, like we expect to. Each one matches against iss. If we examine the Groups collection, we see that Groups[1] contains the i and Groups[2] contains the first s.
The Groups property is also helpful if you want to modify text in the original string that you've located using a Regular Expression.
In the below example, I'm going to very simple replace every instance of the pattern r[aeiou]+m with the word BLAH. The pattern will match one or more vowels between the letters r and m. (If you're unclear as to why, re-read the sections above!)
string l = lipsum;
Regex regex = new Regex("r[aeiou]+m");
Match m = regex.Match(lipsum);
while (m.Success)
{
l = l.Substring(0, m.Index) + "BLAH" + l.Substring(m.Index + m.Value.Length);
m = regex.Match(l);
}
We keep looping as long as a match is found and use the Index property of the Match object to determine where in the original string our match was found. We remove the match and replace it with BLAH.
The resulting string is:
LoBLAH ipsum dolor sit amet, consectetuer adipiscing elit. Duis non nulla id sapien molestie pulvinar. 'Nulla vitae risus vel quam imperdiet egestas!' Vestibulum fringilla consequat pede. Quisque tortor lectus, rhoncus ut, posuere vel, rhoncus in, tellus. Fusce mi. Curabitur eget augue sit amet loBLAH iaculis sagittis. Nam et massa. Nunc sagittis, libero et eleifend aliquet, mi sem varius orci, sit amet sagittis turpis est nec dolor? Nulla facilisi. Proin volutpat erat a sem. Maecenas nibh libero, euismod at, consequat quis, rutBLAH in, turpis. Aenean erat enim, fermentum a, luctus non, bibendum in, tellus. In ac libero. Suspendisse potenti. Pellentesque tincidunt dignissim mi.
Which you can compare with the original, here:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Duis non nulla id sapien molestie pulvinar. 'Nulla vitae risus vel quam imperdiet egestas!' Vestibulum fringilla consequat pede. Quisque tortor lectus, rhoncus ut, posuere vel, rhoncus in, tellus. Fusce mi. Curabitur eget augue sit amet lorem iaculis sagittis. Nam et massa. Nunc sagittis, libero et eleifend aliquet, mi sem varius orci, sit amet sagittis turpis est nec dolor? Nulla facilisi. Proin volutpat erat a sem. Maecenas nibh libero, euismod at, consequat quis, rutrum in, turpis. Aenean erat enim, fermentum a, luctus non, bibendum in, tellus. In ac libero. Suspendisse potenti. Pellentesque tincidunt dignissim mi.
That's quite a lot to take in for one entry on RegEx and should definitely get any novices a giant step closer to Regular Expression mastery! Look for an advanced discussion (including some .NET specific RegEx constructs) in the future.
So, I decided to write this small tutorial to give a basic description of RegEx. I will be using .NET for the examples, but the patterns themselves should be valid in most Regular Expression implementations.
For some of our examples below we will require a larger block of example text, I'm going to use a paragraph of some random Lorem Ipsum text, with some random punctuation thrown in:
string lipsum = "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Duis non nulla id sapien molestie pulvinar. 'Nulla vitae risus vel quam imperdiet egestas!' Vestibulum fringilla consequat pede. Quisque tortor lectus, rhoncus ut, posuere vel, rhoncus in, tellus. Fusce mi. Curabitur eget augue sit amet lorem iaculis sagittis. Nam et massa. Nunc sagittis, libero et eleifend aliquet, mi sem varius orci, sit amet sagittis turpis est nec dolor? Nulla facilisi. Proin volutpat erat a sem. Maecenas nibh libero, euismod at, consequat quis, rutrum in, turpis. Aenean erat enim, fermentum a, luctus non, bibendum in, tellus. In ac libero. Suspendisse potenti. Pellentesque tincidunt dignissim mi.";
First, a simple overview of some of the simpler RegEx constructs:
Literal Text
It's simple. Literal text matches exactly. If text is not modified by Regular Expression constructs (punctuation), then it should be taken literally. So a search for the Regex pattern adam will look for my name in a block of text.
[] - Character Classes and Ranges
Character classes are used to indicate exact character matches. For example, if I want to match all the vowels in the above text I can use:
[aeiou]
This will match exactly one character (a vowel) in any block of text that it is matched against. I'll use this pattern to illustrate the simplest way to collect all matches in a RegEx expression.
using System.Text.RegularExpressions;
//this is needed at the very top
RegexOptions options = RegexOptions.IgnorePatternWhitespace;
options |= RegexOptions.IgnoreCase;
Regex regex = new Regex("[aeiou]", options);
MatchCollection mc = regex.Matches(lipsum);
(Please note that to conserve space, I won't repeat the calls to add the RegexOptions, but you can assume that IgnorePatternWhitespace and RegexOptions.IgnoreCase were used in each example.)
After the snippet is run, the MatchCollection object, mc, holds all the vowel matches in the Lorem ipsum text (o, e, i, u, o, o, i and so on). Note that without the RegexOptions.IgnoreCase option, our pattern would need to be [AEIOUaeiou] in order to match the uppercase letters as well.
Character classes can also be used to contain ranges. To match against any letter in the alphabet, the following character class can be used:
[A-Za-z]
When matched against the lipsum text, this will match once for each single letter. Some other example character classes are:
[0-9] Any number
[a-ep-z] Any letter between 'a' and 'e' (inclusive) or between 'p' and 'z' (inclusive)
[A-Za-z0-9] Any letter or number
Note that the above will literally match one single character each time. If we need to match more characters, we can use additional aspects of Regex to indicate that.
First, we will look at the * modifier. This indicates that the preceding match component should be matched zero or more times. (The zero is important because it means that empty strings will match as well.) For example:
The regular expression [a-z]* will match all of the following:
a
d
adam
lorem
abcdefghijklmnopqrstuvwxyz
We can change the * into a + to match the preceding component one or more times. That is the difference between the * and the +. The * can have 0 matches and still satisfy the regular expression, while the + requires at least one physical match to be considered a valid match. For example, in the lipsum text above, the pattern [0-9]* would match, but [0-9]+ wouldn't, since the former is matched by empty strings. It is also important to note thatm by default, regular expressions are greedy and try to match as many characters as possible. Both the * and the + will include as many characters in their matches as they can.
Additionally, we can also use the ? which will match zero or one of the preceding pattern, essentially making it an "optional match". So:
L?orem?
will match any of the following:
Lorem
Bored
ore
Additional Modifiers
^ (Caret)
This modifier can be used in two very different ways. The first way is at the beginning of the inside of a character class. If present, the caret negates the meaning of the character class and instead matches ANY character except those inside of the brackets.
If we match the above lipsum text against the pattern [^sjdhflo ]+ (note that it includes the space) we are matching against one or more characters in a row that are not s, j, d, h, f, l, o or space.
If we actually match the above pattern we see many results (239 in all), such as rem, ip, um and so on.
^ (Caret) Part 2
The ^ can also be used outside of character classes, but only at the very beginning of a RegEx pattern. When present, it anchors the pattern to the start of a line of text. This is either the start of a string or after each hard line break. So, in the above lipsum text, the pattern ^lorem would only return one match, even though the word appears in the text twice. (Note that because the above does not have any hard line breaks - only soft line breaks caused by the formatting of the webpage - the ^ will only match at the beginning of the entire block of text.)
$
The $ is used only at the end of a RegEx pattern, in a use that is opposite that of the ^ shown above. The $ indicates that pattern is anchored to the end of a line of text. See the following example:
string s = "Sally sells seashells by the sea";
Regex regex = new Regex("sea$");
MatchCollection mc = regex.Matches(s);
In the above code, mc will only contain one match. This will be the one at the end, since it has the $. If we remove the $, it will instead have two matches. Like so:
string s = "Sally sells seashells by the sea";
Regex regex = new Regex("sea");
MatchCollection mc = regex.Matches(s);
Note that if we change the original string to have proper punctuation at the end (the period) the original sea$ pattern will not match at all.
string s = "Sally sells seashells by the sea.";
Regex regex = new Regex("sea$");
MatchCollection mc = regex.Matches(s);
//mc.Count is now 0!!!
This is because the end of pattern is looking for the word sea at the end of the block of text. Because the period is there and we're not looking for it, our pattern won't match!
There are a number of ways to fix this problem, but a general punctuation match will do the trick.
string s = "Sally sells seashells by the sea.";
Regex regex = new Regex("sea[!?.,-]$", options);
MatchCollection mc = regex.Matches(s);
Note the placement of the hyphen (-) at the end of the character class so that it isn't accidentally misinterpreted as a range of characters.
Note also the caveat that although the period outside of the character class WOULD match, it wouldn't do exactly what you think. Can you guess what the following example will print?
string s = "Sally sells seashells by the seaX";
Regex regex = new Regex("sea.$", options);
MatchCollection mc = regex.Matches(s);
if (mc.Count == 0)
Response.Write("0 matches!");
else if (mc.Count == 1)
Response.Write("1 match!");
If you guessed "1 match!", you're correct - but do you know why? The answer lies in our next modifier.
. (Period)
The . is used to represent ANY character, but at least one character must be in the matching position. It's really just a placeholder to say that "something" has to be in the position indicated.
In the example given above, the X fills the position that the . is in, so we get our "1 match!".
Escaped Characters and Shortcuts
The \ backslash is an extremely powerful character in RegEx pattern matching with a plethora of uses. First, it is used to escape special characters to be taken as literals within Regular Expression patterns. For example, if we were to search for text within parenthesis, we would need to escape the parenthesis, since they are used within RegEx patterns to form groups, as I'll explain below.
To search for the word Trunks in parenthesis the pattern would be:
\(Trunks\)
Please note in the following example which shows a common problem for beginners using RegEx in .NET:
string s = "This sentence is about my cat (Trunks).";
Regex regex = new Regex("\\(Trunks\\)");
MatchCollection mc = regex.Matches(s);
Note that in the above code I used two backslashes before the parenthesis. Can you figure out why? Here's an alternative way of writing it which might give you a hint:
Regex regex = new Regex(@"\(Trunks\)");
In .NET, as well as many other languages, characters within strings can be escaped by the backslash character. \n is the newline, \t is the tab, \r is the carriage return and so on. The .NET string that you're using for the RegEx pattern is first interpreted by the .NET parser, so any escaped characters will already have been escaped by the time the RegEx parser takes control and the pattern will not match correctly.
In order to get around that problem, we first escape the backslash itself, by using a \\ construct. As mentioned above (and below) the parenthesis is used as a special character in RegEx and needs to be escaped if you intend to use it as a literal.
In the pattern \\(Trunks\\), the backslash is first escaped as a .NET string so that the RegEx parser actually sees \(Trunks\). This escapes the parenthesis in the RegEx parsing and correctly finds the parenthesis. As an additional note, if you're unclear as to why the @"\(Trunks\)" works, it is a C# verbatim string. In a verbatim string, all characters are automatically escaped (except for the double-quote character, which you represent by doubling up "") and you can include formatting like tabs and newlines.
Please remember, this is for our .NET example and won't apply to all languages. If you're not using .NET, check with your language reference to see if you need to escape the backslashes.
RegEx Escaped Characters
RegEx has its own escape strings that are mostly used as shortcuts for a large ranges of characters. Like many other languages, you escape RegEx characters by using a backslash.
Note that in .NET you need to take the above caveat into considering when using these, so, for example, to search for a literal backslash, you would need to escape it in both .NET and RegEx: \\\\. This is because first .NET will escape it into \\, then RegEx will escape it into \ and search for the literal value.
Here are a few examples:
\d The same thing as [0-9].
\w The same thing as [a-zA-Z_0-9].
\s This will match against any whitespace.
The capital versions offer negations...
\D The same thing as [^0-9].
\W The same thing as [^a-zA-Z_0-9].
\S This will match against any character EXCEPT whitespace.
There are many other, less used, escaped characters. Be sure to check with your RegEx implementation's documentation (click here for the .NET resource.)
Note again that the backslash should be double-escaped when necessary.
Grouping
In the teaching of Regular Expressions, Grouping is usually considered an advanced topic and not taught in a first lesson. Personally, I don't see the point of learning how to use RegEx unless you can use it!
A group is automatically formed everytime matching pairs of parenthesis are used in a pattern. Each match has an automatic group of the entire match and then each subsequent parenthesis pair, as shown in the following example:
string s = "Mississippi";
Regex regex = new Regex("([aeiou][s]+)");
MatchCollection mc = regex.Matches(s);
The MatchCollection.Count property would be 2, indicating that the pattern matched twice (on iss both times, since we're matching any vowel followed by the letter s one or more times). If we examine the MatchCollection, we see it is made of Match objects, with a Groups property, another collection.
In all matches, the first Group (mc[0].Groups[0]) always contains the entire match, in this case iss. Actually, in this case, the second Group (mc[0].Groups[1]) also contains iss. This is because we're grouping the entire match and the results of our match was iss. You would see the exact same results in the second Match object since iss appears twice in the word Mississippi. Hence both mc[1].Groups[0] and mc[1].Groups[1] would contain the string iss.
If we change the parenthesis slightly and only group the vowel:
string s = "Mississippi";
Regex regex = new Regex("([aeiou])[s]+");
MatchCollection mc = regex.Matches(s);
The MatchCollection.Count property would still be 2 with the same exact matches. Also, the Groups[0] property would still contain iss, since that's our entire match. However, the Groups[1].Value would now contain only the i, since that's the entire match inside our parenthesis.
We'll change it slightly one more time:
string s = "Mississippi";
Regex regex = new Regex("([aeiou])(s)s");
MatchCollection mc = regex.Matches(s);
Our pattern is trying to match a vowel followed by two letter s characters. We are grouping the vowel by itself and additionally grouping the first of two s characters.
When we run the above code we get two matches, like we expect to. Each one matches against iss. If we examine the Groups collection, we see that Groups[1] contains the i and Groups[2] contains the first s.
The Groups property is also helpful if you want to modify text in the original string that you've located using a Regular Expression.
In the below example, I'm going to very simple replace every instance of the pattern r[aeiou]+m with the word BLAH. The pattern will match one or more vowels between the letters r and m. (If you're unclear as to why, re-read the sections above!)
string l = lipsum;
Regex regex = new Regex("r[aeiou]+m");
Match m = regex.Match(lipsum);
while (m.Success)
{
l = l.Substring(0, m.Index) + "BLAH" + l.Substring(m.Index + m.Value.Length);
m = regex.Match(l);
}
We keep looping as long as a match is found and use the Index property of the Match object to determine where in the original string our match was found. We remove the match and replace it with BLAH.
The resulting string is:
LoBLAH ipsum dolor sit amet, consectetuer adipiscing elit. Duis non nulla id sapien molestie pulvinar. 'Nulla vitae risus vel quam imperdiet egestas!' Vestibulum fringilla consequat pede. Quisque tortor lectus, rhoncus ut, posuere vel, rhoncus in, tellus. Fusce mi. Curabitur eget augue sit amet loBLAH iaculis sagittis. Nam et massa. Nunc sagittis, libero et eleifend aliquet, mi sem varius orci, sit amet sagittis turpis est nec dolor? Nulla facilisi. Proin volutpat erat a sem. Maecenas nibh libero, euismod at, consequat quis, rutBLAH in, turpis. Aenean erat enim, fermentum a, luctus non, bibendum in, tellus. In ac libero. Suspendisse potenti. Pellentesque tincidunt dignissim mi.
Which you can compare with the original, here:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Duis non nulla id sapien molestie pulvinar. 'Nulla vitae risus vel quam imperdiet egestas!' Vestibulum fringilla consequat pede. Quisque tortor lectus, rhoncus ut, posuere vel, rhoncus in, tellus. Fusce mi. Curabitur eget augue sit amet lorem iaculis sagittis. Nam et massa. Nunc sagittis, libero et eleifend aliquet, mi sem varius orci, sit amet sagittis turpis est nec dolor? Nulla facilisi. Proin volutpat erat a sem. Maecenas nibh libero, euismod at, consequat quis, rutrum in, turpis. Aenean erat enim, fermentum a, luctus non, bibendum in, tellus. In ac libero. Suspendisse potenti. Pellentesque tincidunt dignissim mi.
That's quite a lot to take in for one entry on RegEx and should definitely get any novices a giant step closer to Regular Expression mastery! Look for an advanced discussion (including some .NET specific RegEx constructs) in the future.
Javascript and ASP.NET Hacks
September 12, 2007 • 7:25AM • permalink
Both ASP.NET and Javascript can be extremely useful, in entirely different ways. ASP.NET is a great server-side environment and Javascript can be used to enhance the client-side experience. In my experience, junior developers often have a difficult time getting the two to play nicely together, so I thought I would share a few common tricks. (Please note tricks apply whether you code using Visual Basic or C#. Also, many of these tricks or similar implementations of them are trivial to implement in many other server-side languages, such as PHP, Python or JSP.)
1) Data Injection (ASP.NET => Javascript)
First, in the code-behind area of the page, we setup a simple string variable from an external source:
private string username;
public string Username
{
get { return username; }
set { username = value; }
}
public void Page_Load(object sender, EventArgs e)
{
username = Request["username"];
}
Then, in the front-end part of the page, we can use this variable for a Javascript injection:
<head>
<script type="text/javascript">
alert('<% =Username %>');
</script>
</head>
The result, is that when the page loads, the value that is in username is injected into the Javascript. So if the value Adam is passed into the page, the Javascript is transformed at runtime to:
<head>
<script type="text/javascript">
alert('Adam');
</script>
</head>
So that when the page loads, the alert box appears with the requested username:

2) Input Injection (Javascript => ASP.NET)
There are many ways to get data from a HTML form to the ASP.NET code, including the basic query string and basic form post. Sometimes though, a server control doesn't contain the dynamic nature needed to properly collect user input. In the following example, we're going to collect data from multiple checkboxes and pass them back as a comma-delimited string.
Before I begin, you may ask why I wouldn't just use a CheckboxList or a series of single Checkboxes. You could (especially after reading the third trick that I will present below), but that would make it a little more difficult to do a few dynamic tricks with the checkboxes, like a Select All or Select None functionality.
First, the back-end code:
private string received_values;
public string ReceivedValues
{
get { return received_values; }
set { received_values = value; }
}
public void Page_Load(object sender, EventArgs e)
{
received_values = Request.Form["sent_value"];
}
All we're doing is making the results of a HTTP form post, with the input name "sent_value", publically accessible.
Then, on the front-end, we're going to create our checkboxes based on the contents of a static array. This is only to make a simple example, and our "IDValues" could represent anything from friends on a buddylist, stocks in a portfolio, books in a library, software titles in a shopping cart - anything you could retrieve from a database, XML feed, etc.
Here's the entire code listing. The explanation is below it.
<form id="the_form" method="post">
<% int[] IDValues = new int[] { 5, 10, 25, 50 }; %>
<% for (int x = 0, cnt = IDValues.Length; x < cnt; ++x) { %>
<input type="checkbox" id="someid_<% =IDValues[x] %>" /> <% =IDValues[x] %>
<% } %>
<input type="hidden" id="sent_value" name="sent_value" value="" />
<input type="button" onclick="doSubmit(); return false;" value="Submit" />
</form>
<% if (!string.IsNullOrEmpty(ReceivedValues)) { %>
<strong>Received Values:</strong> <% =ReceivedValues %>
<% } %>
<script type="text/javascript">
function doSubmit()
{
var frm = document.getElementById('the_form');
var post_string = "";
for (var x = 0, cnt = frm.elements.length; x < cnt; ++x)
{
if (frm.elements[x].checked)
post_string += frm.elements[x].id.substring(7) + ",";
}
if (post_string.length > 0)
post_string = post_string.substring(0, post_string.length - 1);
document.getElementById('sent_value').value = post_string;
frm.submit();
}
</script>
It's actually very simple! We create four checkboxes, easily identified by the prefix 'someid_' in their id property. When the button is clicked, we obtain a reference to the form object and loop through all of its elements. If the item is checked (obviously indicating a checkbox in our example), then we remove the 'someid_' prefix and append the id to a running string, with a comma-delimeter.
After traversing the whole form, we cleanup the string by removing the extraneous comma and store the value in a hidden input tag we've created already. This is the key to posting the resulting values to the back-end.
Upon submission, the ReceivedValues string will be populated and will be output, like so:

3) Javascript on ASP.NET Controls (Javascript <=> ASP.NET)
Finally, there are a few additional tricks you can mix in to ease the integration of Javascript and ASP.NET. A very simple example would be a form that requires both client-side validation and server-side validation.
I'll assume the reader can already output an error message using either Javascript or ASP.NET. In this example, we'll assume that we have an existing system to validate a page and upon error, set the InnerHtml property of a div object with the ID 'ErrorMessage'. (Note that divs are implemented as HttpGenericControl objects on the back-end)
If we decided to add in Javascript validation as well (possibly to implement a 'strong password' indicator like Live.com), we don't want to have to create a new location for Javascript error messages.
Using a simple trick, we don't have to:
<form runat="server">
<div id="ErrorMessage" runat="server"></div>
<script type="text/javascript">
document.getElementById("<% =ErrorMessage.ClientID %>").innerHTML = "Cool, eh?";
</script>
</form>
That's all there is to it! While this example is oversimplified, it is easy to see how it can be implemented and extended. This applies to all the examples given above. With the plethora of ASP server controls and Javascript methods available, not to mention AJAX implementations, it's very easy to see how you can make your sites much more dynamic by using the above tricks.
1) Data Injection (ASP.NET => Javascript)
First, in the code-behind area of the page, we setup a simple string variable from an external source:
private string username;
public string Username
{
get { return username; }
set { username = value; }
}
public void Page_Load(object sender, EventArgs e)
{
username = Request["username"];
}
Then, in the front-end part of the page, we can use this variable for a Javascript injection:
<head>
<script type="text/javascript">
alert('<% =Username %>');
</script>
</head>
The result, is that when the page loads, the value that is in username is injected into the Javascript. So if the value Adam is passed into the page, the Javascript is transformed at runtime to:
<head>
<script type="text/javascript">
alert('Adam');
</script>
</head>
So that when the page loads, the alert box appears with the requested username:

2) Input Injection (Javascript => ASP.NET)
There are many ways to get data from a HTML form to the ASP.NET code, including the basic query string and basic form post. Sometimes though, a server control doesn't contain the dynamic nature needed to properly collect user input. In the following example, we're going to collect data from multiple checkboxes and pass them back as a comma-delimited string.
Before I begin, you may ask why I wouldn't just use a CheckboxList or a series of single Checkboxes. You could (especially after reading the third trick that I will present below), but that would make it a little more difficult to do a few dynamic tricks with the checkboxes, like a Select All or Select None functionality.
First, the back-end code:
private string received_values;
public string ReceivedValues
{
get { return received_values; }
set { received_values = value; }
}
public void Page_Load(object sender, EventArgs e)
{
received_values = Request.Form["sent_value"];
}
All we're doing is making the results of a HTTP form post, with the input name "sent_value", publically accessible.
Then, on the front-end, we're going to create our checkboxes based on the contents of a static array. This is only to make a simple example, and our "IDValues" could represent anything from friends on a buddylist, stocks in a portfolio, books in a library, software titles in a shopping cart - anything you could retrieve from a database, XML feed, etc.
Here's the entire code listing. The explanation is below it.
<form id="the_form" method="post">
<% int[] IDValues = new int[] { 5, 10, 25, 50 }; %>
<% for (int x = 0, cnt = IDValues.Length; x < cnt; ++x) { %>
<input type="checkbox" id="someid_<% =IDValues[x] %>" /> <% =IDValues[x] %>
<% } %>
<input type="hidden" id="sent_value" name="sent_value" value="" />
<input type="button" onclick="doSubmit(); return false;" value="Submit" />
</form>
<% if (!string.IsNullOrEmpty(ReceivedValues)) { %>
<strong>Received Values:</strong> <% =ReceivedValues %>
<% } %>
<script type="text/javascript">
function doSubmit()
{
var frm = document.getElementById('the_form');
var post_string = "";
for (var x = 0, cnt = frm.elements.length; x < cnt; ++x)
{
if (frm.elements[x].checked)
post_string += frm.elements[x].id.substring(7) + ",";
}
if (post_string.length > 0)
post_string = post_string.substring(0, post_string.length - 1);
document.getElementById('sent_value').value = post_string;
frm.submit();
}
</script>
It's actually very simple! We create four checkboxes, easily identified by the prefix 'someid_' in their id property. When the button is clicked, we obtain a reference to the form object and loop through all of its elements. If the item is checked (obviously indicating a checkbox in our example), then we remove the 'someid_' prefix and append the id to a running string, with a comma-delimeter.
After traversing the whole form, we cleanup the string by removing the extraneous comma and store the value in a hidden input tag we've created already. This is the key to posting the resulting values to the back-end.
Upon submission, the ReceivedValues string will be populated and will be output, like so:

3) Javascript on ASP.NET Controls (Javascript <=> ASP.NET)
Finally, there are a few additional tricks you can mix in to ease the integration of Javascript and ASP.NET. A very simple example would be a form that requires both client-side validation and server-side validation.
I'll assume the reader can already output an error message using either Javascript or ASP.NET. In this example, we'll assume that we have an existing system to validate a page and upon error, set the InnerHtml property of a div object with the ID 'ErrorMessage'. (Note that divs are implemented as HttpGenericControl objects on the back-end)
If we decided to add in Javascript validation as well (possibly to implement a 'strong password' indicator like Live.com), we don't want to have to create a new location for Javascript error messages.
Using a simple trick, we don't have to:
<form runat="server">
<div id="ErrorMessage" runat="server"></div>
<script type="text/javascript">
document.getElementById("<% =ErrorMessage.ClientID %>").innerHTML = "Cool, eh?";
</script>
</form>
That's all there is to it! While this example is oversimplified, it is easy to see how it can be implemented and extended. This applies to all the examples given above. With the plethora of ASP server controls and Javascript methods available, not to mention AJAX implementations, it's very easy to see how you can make your sites much more dynamic by using the above tricks.
|
page 1 of 2
|
