in distributed systems ~ read.

Distributed Data Processing on the example of DataSynapse

Nowadays we hear a lot of talks about a Big Data and its processing. Everyone have heard about the Map/Reduce concept and such distributed storages like Hadoop, Cassandra, MongoDB, etc.

These instruments provide us with possibility to have a lot of persisted data (distributed by different servers) and periodically perform some parallel calculations with these data...
Calculations that must be mapped and then reduced (e.g. tricky "select"-query to distributed data).

But what if we don't want store our input data? What if our task can't or shouldn't be "MapReduced"?

In this case we can (and will) use so-called DataGrid systems:

These systems allow us to unite several (a lot of) computers into one grid and use it for various distributed calculations.

Let's take a look on an ordinary task for grid systems. For example, we have a lot of data (e.g. financial reports) which should be processed by some algorithm. All these reports are available through some shared resource (nfs, db, cache). Also we have a pack of computers to perform calculation. And, of course, we can easily write/implement an algorithm for reports processing! There is only one problem - how to write the code which allow us to run our processing algorithm on all of our computers?

And here is the place where DataGrid systems enter the game! Let's consider Tibco DataSynapse as our DataGrid system. I'll not describe installation and configuration processes, once they are already described in the official documentation. All I want to show you is how easy to write an distributed application which can be used in DataSynapse.

With DataSynapse the code (which implements our financial reports processing algorithm) looks like below:

public class CalculationService {

    public void init() {
        // init sources
    }

    private void readInputData(Object criteria) {
        // read data
    }

    private void storeOutputData(Object result) {
        // read data
    }

    public void doCalc(Object criteria) {
       Object inputData = readInputData(criteria);

       // Do some calculation

       storeOutputData(result);
    }
}

To make this logic work inside a DataSynapse instance we need... nothing! Just pack it into jar, add a simple descriptor, upload into DataSynapse instances through a DataSynapse manager, specify available methods and start your calculations! All these steps are completely described in the administrator and developer guides.

After you deployed your application through the DataSynapse manager, it will be available to run on all of your hardware nodes. You'll be able to run or terminate jobs, see how much resources (nodes, processors, etc) it consumes through the special web interface.

You can access your application through the special tcp-based protocol or HTTP (SOAP).

To start calculations you need to write a simple client:

Service s = ServiceFactory.getInstance().createService("CalculationService");  
CalcServiceHandler handler = new CalcServiceHandler();

final Object[] arguments = new Object[] { new Criteria() };  
s.submit("doCalc", arguments, handler);  

Just do not forget to install DS-properties into client's JVM environment (DS*Director, DSUsername, DSUsername, etc - see documentation).

comments powered by Disqus