You are currently browsing the archives for the Programming category


Image Mixer Java Tool

Image Mixer

I have recently developped a simple Java GUI-based tool which combines multiple images into one. It supports selection of the images to be mixed and the intensity at which they are mixed.

You can download the source code from BitBicket – ImageMixer and an executable JAR package from ImageMixer.jar.

Image Histogram Java Tool

I have recently developped a simple Java GUI-based tool which plots a histogram of a JPG image. The tool manually implements the code for computing and plotting the histograms. It also uses a suple histogram smoothing. The following types of histograms algorithms are computed:

  • Luminosity – the histogram for the luminosity of the pixels is computed.
  • Grayscale 1 – a red-blue-green color pixel is converted into a representative gray pixel based on one popular forumula and the grayscale histogram is computed.
  • Grayscale 2 – a red-blue-green color pixel is converted into a representative gray pixel based on second popular forumula and the grayscale histogram is computed.
  • Red – the histogram for the red color is computed.
  • Green – the histogram for the green color is computed.
  • Blue – the histogram for the blue color is computed.

You can download the source code from ImageHistogram.zip Eclipse Project and a JAR package from ImageHistogram.jar.

Visual Basic script – Applying a formula across a column

This VB script computes a formula accross a column and writes the formula on a range of cells in another column of an Excel file. It can be very useful to compute average / min / max of collection of numbers. Example is demonstrated:

RAW   Average
Index Value   Index Value (AVG)
1 21.95   1 22.73
1 21.21   2 21.82
1 25.02   3 20.91
2 23.66      
2 21.19      
2 20.63      
3 23.31      
3 19.59      
3 19.83      

To apply the script, first select the first cell where the formula should be written.

Sub ComputeAndWriteFormula(formulaName As String, sheetName As String, column As String, startRow As Integer, endRow As Integer, step As Integer)
' This function adds formula starting from starting from the current cell
' The formula will be build in the following way.
' =Formula(Sheet!Column<number>:Column<number+step>

    Dim sheet As String
    Dim formula As String
   
    If Len(sheetName) = 0 Then
        sheet = ""
    Else
        sheet = sheetName + "!"
    End If
   
    For ind = startRow To endRow Step step
        Dim startCell As String
        Dim endCell As String
        Dim endCellRow As Integer
        endCellRow = ind + step - 1
        startCell = column + Trim(Str(ind))
        endCell = column + Trim(Str(endCellRow))
        formula = "=" + formulaName + "(" + sheet + startCell + ":" + endCell + ")"
        ActiveCell.formula = formula
        ActiveCell.Offset(1, 0).Select
    Next
End Sub

Sub TestComputeAndWriteFormula()
    Call ComputeAndWriteFormula("AVERAGE", "raw", "E", 2, 313, 3)

End Sub

Number of vertices (nodes) in a Tree

The number of all vertices in a tree with branching factor "b" and depth "d" is

V = 1 + b2 + b3 +…+ bd-1 = (bd – 1)/(b-1)

The number of edges is E = V -1

Profiling with VTune

I have just seen one tutorial about using VTune to profile an application. The tutorial covers the different aspects that should be analyzed to understand the performance of the application – including cache misses, thread parallelisim, cache line sharing, cycles per instructions (CPI). The tutorial is available at "Using Intel® VTune™ Performance Analyzer Events/ Ratios & Optimizing Applications".

Translating Instruction Address to Source Line

In this article I will describe how to programmatically translate an instruction address from a Win32 executable to a source line. The described approach uses the Microsoft debugger engine dbgeng.dll. Therefore you should have it installed. If you don’t have it, you can download and install the Windows debugger WinDbg which will install dbgeng.dll.

Requirements

To use this approach you will need to have installed windows debugger engine – dbgeng.dll. If you don’t have it, install WinDbg debugger. This will install dbgeng.dll

Approach

Many debuggers support debugging at source code level. The debugger is capable to provide debugging at the source level by translating instruction addresses to lines in source code files. In our approach we use the same mechanisms that debuggers use. This particular solution, to translate an instruction address to a source line, uses functionality provided by the windows debugger engine dbgeng.dll.

In overview, we create a process from the Win32 executable file from which we want to translate instruction addresses. Then we attach the debugger engine to the created process. To create a process and attach the debugger to it we use CreateProcessAndAttach function. When the debugger is attached, we call function GetLineByOffset to do the actual translation. After we finish with the translation task we terminate the process with the function TerminateCurrentProcess. Besides these procedures there are few more things that are necessary to do which I explain in the implementation.

Implementation

First we create objects that implement the following COM interfaces: IDebugControl , IDebugClient, IDebugSymbols. To create these objects we use function DebugCreate.

static IDebugClient5 *dbgClient5 = NULL;
static IDebugSymbols *dbgSymbols = NULL;
static IDebugControl *dbgControl = NULL;
...
DebugCreate(__uuidof(IDebugControl), (void**) & dbgControl);
DebugCreate(__uuidof(IDebugClient), (void **) & dbgClient5);
DebugCreate(__uuidof(IDebugSymbols), (void **) & dbgSymbols);

Before crating the process from the executable we set the debugger engine filters so that the target process breaks into the debugger immediately after it is created. To set the debugger filters we use function SetSpecificFilterParameters.

DEBUG_SPECIFIC_FILTER_PARAMETERS filter[10];
for (int i = 0; i < 10; i++) {
	filter[i].ExecutionOption = DEBUG_FILTER_BREAK;
	filter[i].ContinueOption = DEBUG_FILTER_GO_HANDLED;
	filter[i].TextSize = 0;
	filter[i].CommandSize = 0;
	filter[i].ArgumentSize = 0;
}

dbgControl->SetSpecificFilterParameters(0, 10, filter);

Using function CreateProcessAndAttach we create a windows debug process and attach the debugger to it.

ULONG64 server            = 0;
PSTR    commandLine       = executableName; 
ULONG   processId         = 0; 
ULONG   attachFlags       = 0; 

dbgClient5->CreateProcessAndAttach(
	server, 
	commandLine,
	DEBUG_PROCESS, 
	processId, 	
	attachFlags); 

To translate the instruction address to a source line we call function GetLineByOffset.

ULONG64 offset = instrAddress;
ULONG fileNameBufferSize = MAX_FILE_NAME_SIZE;
memset(sourceInfo->fileName, 0, fileNameBufferSize);	
sourceInfo->lineNo = 0;
sourceInfo->fileSize = 0;
sourceInfo->displacement = 0;
		
HRESULT isOk = dbgSymbols->GetLineByOffset(
	offset,
	&sourceInfo->lineNo,
	sourceInfo->fileName,
	fileNameBufferSize,
	&sourceInfo->fileSize,
	&sourceInfo->displacement);

If we want to translate multiple addresses, we have to call function GetLineByOffset multiple times without repeating the earlier operations. After we finish with the translation we detatch the debugger from the process by terminating the process with function TerminateCurrentProcess.

HRESULT isOk = dbgClient5->TerminateCurrentProcess();

Download

The code complete source code is available for download from InstructionToSourceLine.zip. Note that the instruction address provided as input should be a decimal number but not hexadecimal.

C# Version

There is also a C# implementation of this tool that you can download from InstructionToSourceLineCSharp.zip. This C# implementation consists of 2 parts – DbgEngManaged and InstructionToSourceLineCSharp. DbgEngManaged is a managed c++ library wrapper for the debugger engine DbgEng.dll. InstructionToSourceLineCSharp is a C# console application that references DbgEngManaged library. Note that the error handling here is not done properly so don't judge me for it :) The provided code here is something that I have quickly prototyped and decided to share for those who for some reason need to use the debugger engine. It can also serve as an example of using native and managed code together. 

Uses of const in C++

To exmplain to a friend the different uses and semantics of "const" in c++ I come accross one very nicely written article "The C++ ‘const’ Declaration: Why & How". It concludes with an example that has all the possible uses of "const" in one statement. I copy & paste it here.

Of course one sometimes needs to combine some of these different uses of ‘const’ which can get confusing as in

const int*const Method3(const int*const&)const;

where the 5 uses ‘const’ respectively mean that the variable pointed to by the returned pointer & the returned pointer itself won’t be alterable and that the method does not alter the variable pointed to by the given pointer, the given pointer itself & the object of which it is a method!

A Short and Comprehensive Tutorial on Windows API

I needed to write a simple native Win32 windows application and after searching around I found very nice tutorial – easy to read and with rich examples. So here is the link "Windows API Tutorial". The tutorial includes examples of how to write native (win32) applications with forms, drawgraphics and threading. The tutorial follows a good practice of encapsulating the Win32 abstractions in C++ classes (something which is alredy done at MFC) and makes things to understand easier.

Cache Efficient Matrix Multiplication

In this article I will show a cache efficient algorithm for matrix multiplication and the impact of caches on the performance of the algorithms. Matrix multiplication is a very simple algorithm with upper and lower bounded Q(n3) complexity. If you are not familiar with the algorithm I would suggest first to read the Matrix multiplication at Wikipedia. CPU Cache is a low capacity very fast storage which physical location is very close to the CPU. CPU caches make our programs run faster in three different ways:

  1. The program operations that are executed repeatedly are cached and therefore our program runs faster because the new instructions are not fetched from the slow memory but the cache – this is mainly relevant to a specialized instruction cache and is not in our focus since we will not deal with the complexity optimizations.
  2. The repeatedly accessed variables in the code are cached and the subsequent access is faster – this is known as temporal locality.
  3. Variables that are located next to each other in memory, such as array elements, are accessed faster. When the first array element is accessed for the first time in the program, that element together with the subsequent few array elements are brought into the cache – this is known as spatial locality.

For more information about how cache is implemented to provide the above features you can refer again to the Wikipedia entry CPU Cache or a very nice article by Jon Stokes titled Understanding CPU caching and performance.

The Matrix Multiplication Algorithms

Now, after a brief overview, let's return back to our matrix multiplication algorithm. Here I have two algorithms:

  1. The well known STANDARD algorithm for matrix multiplication that most of us are taught at high school.
  2. An OPTIMIZED algorithm, which gets advantage from the spatial locality of the array members (i.e. the (3) way caches make programs run faster) by traversing the matrices' elements in a different order than in the STANDARD algorithm.

After multiplying matrices Am,n and Bn,k each entry ci,j in the resultant matrix Cm,k is:

ci,j=ai,0.b0,j + ai,1.b1,j + .. + ai,n.bn,j, where (0 ≤ i ≤ m) and (0 ≤ i ≤ k).

 

The Standard Algorithm

Below is the standard algorithm for multiplying 2 square matrices. It resembles the formula above by computing each element in matrix C in by traversing the matrix C only one time.

 

1: for (k = 0; k < n; k++)

2: {

3:    for (i = 0; i < n; i++)

4:    {

5:        for (j = 0; j < n; j++)

6:        {

7:           c[k][i] = c[k][i] + a[k][j]*b[j][i];

8:        }

9:    }

10: }

 

The Optimized Algorithm

Below is the standard algorithm for multiplying 2 square matrices. This algorithm incrementally computes the elements in matrix C by iterating on matrix C N2 times. The two algorithms are different only at line 7. The optimized algorithm traverses matrix B along columns, thus benefiting from the spatial locality. Also, it traverses matrix C multiple times thus benefiting from temporal locality of L2.

1: for (k = 0; k < n; k++)

2: {

3:    for (i = 0; i < n; i++)

4:    {

5:       for (j = 0; j < n; j++)

6:       {

7:          c[i][j] = c[i][j] + a[i][k]*b[k][j];

8:       }

9:    }

10: }

 

Performance Results

The table below shows the performance of the two algorithms for different sizes of the matrices. For simplicity, here we assume that matrices are square. Also the table shows the cache misses obtained with VTune profiling tool. The characteristics of the computer I run the application are 1.66GHz Dual Core CPU, with 2MB L2 cache per core and 32KB L1 instruction and data cache. I compiled the source using the Microsoft Visual C++ compiler. The L1 miss rate is per retired instruction and L2 miss rate is per referenced data.

Size

MB

Standard

Optimized

Time

L1 Miss

L2 Miss

Time

L1 Miss

L2 Miss

128×128

64K

0.1

5.5%

0%

0.09

0.7%

0%

256×256

256K

0.23

7.3%

0%

0.2

0.5%

0%

512×512

1MB

1.5

7.1%

0.1%

1.5

0.5%

0.2%

1024×1024

4MB

44.2

7.2%

12.5%

7.9

0.5%

0.6%

2048×2048

16MB

394.7

10.1%

12.5%

63.5

0.5%

0.6%

 Cache efficient matrix multiplication performance

 

Why the Optimized Algorithm is Faster?

Now let's try to explain why the optimized algorithm runs faster. From the table above, we see that for small numbers the performance hurt is not evident. The reason for this is that matrices are small enough to fit almost all in the L1 cache and all in L2 cache.

1st major reason for the performance penalty with big matrices in the standard algorithm comes from L1 cache. The standard algorithm traverses matrix B along its columns and this way not benefiting from spatial locality in the L1 cache. 2nd performance penalty comes from L2 cache. The standard algorithm does not benefit both from the spatial locality because it traverses the matrix B along its columns and whenever a non-cached row is accessed it should be cached (note this is valid for both L1 and L2). Moreover, the standard matrix does not benefit from the temporal locality as it accesses each element of matrix C only once (not repeatedly). Although the element being computed element (c[k][j]) is stored in a register obviously this doesn't contribute to the performance at all.

The optimized algorithm is not as optimal on traversing matrix A, as is in the standard algorithm but the gained performance from the other optimizations makes it negligible.

Conclusion

From this small example we can see that, besides the theoretical complexity dimension as studied and analyzed in the "Introduction to Algorithms", the delivered performance of algorithms depends also on the micro architectural details of our computers. And to write performance efficient programs we should know the underlying architectural details. This algorithm for example could be further refined to benefit from the cache line sizes but this would make it micro-architectural (CPU) dependent and thus run slower in other processors.

Downloads

The source code of the matrices is available here: MatrixMuliplay.

Uninitialized Variable May Waste a Whole Day

Today I was writing a program and implement I implemented a method that parses contents of a raw byte buffer into different primitive data types – including 32bit address and store them in 64bit unsigned integers.

Because I had two class member fields with similar names and I missed to initialize one of them to 0. Then when I copy the 32bit address from the buffer into the 64bit integer variable, my program was crashing like crazy and I didn’t have a clue that it could be because of the uninitialized field. I was checking if the 64bit variable is equal to the expected value (like in the code demonstrated below) but haaahh, it was not equal. And again, no idea why?!? So inspecting the code carefully and inserting stupid printfs around (ok I was not able to debug my application with a debugger) I could found the nasty uninitialized field at the end of the day at 01:20 midnight.

So here comes the lesson: Uninitialized variables are very dangerous and besides wasting your time may cause irreversible problems. It looks I should be thankful that my libraries were crashing and not hiding the bug.
Below I attach the code segment that is an example of the error I had today.

To compile: g++ -g -o UninitializedFieldCpp UninitializedFieldCpp.cpp

//————————————————————————-
//
// How I wasted a whole day because of uninitialized member field in C++.
//
//————————————————————————-

//#include “stdafx.h”

#include
#include

using namespace std;

typedef unsigned long long ULONG64;
typedef unsigned int ULONG;
typedef unsigned char byte;

//
// Here we have 16 bytes
//
byte buffer[] = {0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00};

class MyClass
{
public:
MyClass():_initializedField(0) {}

void copyNumberUninitializedField()
{
//
// Here we get a pointer to the first FF byte in the buffer
//
byte* pointerToFF = buffer + 4;

memcpy(&_uninitializedField, pointerToFF, sizeof(ULONG));

if (_uninitializedField == 0xffffffff)
{
cout << “_uninitializedField: there is no problem” << endl;
}
else
{
cout << “ERROR: ” << endl;
cout << “_uninitializedField must be equal to 0xffffffff, but was: “;
cout << hex << _uninitializedField << endl;
}
}

void copyNumberInitializedField()
{
//
// Here we get a pointer to the first FF byte in the buffer
//
byte* pointerToFF = buffer + 4;

memcpy(&_initializedField, pointerToFF, sizeof(ULONG));

if (_initializedField == 0xffffffff)
{
cout << “_initializedField: there is no problem” << endl;
}
else
{
cout << “ERROR: ” << endl;
cout << “_initializedField must be equal to 0xffffffff, but was: “;
cout << hex << _initializedField << endl;
}
}

private:
//
// This is our uninitialized field.
//
ULONG64 _uninitializedField;

//
// This is our good initialized field
//
ULONG64 _initializedField;
};

int main(int argc, char* argv[])
{
MyClass m = MyClass();

m.copyNumberUninitializedField();

cout << endl << endl;

m.copyNumberInitializedField();

cout << “Done” << endl;

return 0;
}