DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Curious about the future of data-driven systems? Join our Data Engineering roundtable and learn how to build scalable data platforms.

Data Engineering: The industry has come a long way from organizing unstructured data to adopting today's modern data pipelines. See how.

Threat Detection: Learn core practices for managing security risks and vulnerabilities in your organization — don't regret those threats!

Managing API integrations: Assess your use case and needs — plus learn patterns for the design, build, and maintenance of your integrations.

Avatar

Saurabh Chhajed

Senior BigData Engineer at Impetus Technologies

Scottsdale, US

Joined May 2014

https://saurzcode.in

About

Saurabh Chhajed is a Senior Big Data Engineer, who enjoys exploring new technologies and applying them for good use. He is a Cloudera Certified Hadoop Developer and worked on many enterprise products for some of the largest banks and organizations in the US. He is a firm believer in open source and it's power in bringing the ongoing technology revolution. He writes articles on Java Technologies, Web, Opensource, and Big Data World.

Stats

Reputation: 157
Pageviews: 1.0M
Articles: 9
Comments: 14
  • Articles
  • Comments

Articles

article thumbnail
Write a Kafka Producer Using Twitter Stream
With the newly open sourced Twitter HBC, a Java HTTP library for consuming Twitter’s Streaming API, we can easily create a Kafka twitter stream producer.
Updated October 5, 2020
· 38,534 Views · 13 Likes
article thumbnail
What Is RDD in Spark and Why Do We Need It?
Spark has already over taken Hadoop (MapReduce) in general, because of benefits it provides in terms of faster execution in iterative processing algorithms.
October 26, 2015
· 58,993 Views · 7 Likes
article thumbnail
How-To: Setup Development Environment for Hadoop MapReduce
This post is intended for folks who are looking out for a quick start on developing a basic Hadoop MapReduce application. We will see how to set up a basic MR application for WordCount using Java, Maven and Eclipse and run a basic MR program in local mode , which is easy for debugging at an early stage. Assuming JDK 1.6+ is already installed and Eclipse has a setup for Maven plugin and download from default maven repository is not restriced. Problem Statement : To count the occurrence of each word appearing in an input file using MapReduce. Step 1 : Adding Dependency Create a maven project in eclipse and use following code in your pom.xml. 4.0.0 com.saurzcode.hadoop MapReduce 0.0.1-SNAPSHOT jar org.apache.hadoop hadoop-client 2.2.0 Upon saving it should download all required dependencies for running a basic Hadoop MapReduce program. Step 2 : Mapper Program Map step involves tokenizing the file, traversing the words, and emitting a count of one for each word that is found. Our mapper class should extend Mapper class and override it’s map method. When this method is called the value parameter of the method will contain a chunk of the lines of file to be processed and the output parameter is used to emit word instances. In real world clustered setup, this code will run on multiple nodes which will be consumed by set of reducers to process further. public class WordCountMapper extends Mapper { private final IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while(tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } } Step 3 : Reducer Program Our reducer extends the Reducer class and implement logic to sum up each occurrence of word token received from mappers.Output from Reducers will go to the output folder as a text file ( default or as configured in Driver program for Output format) named as part-r-00000 along with a _SUCCESS file. public class WordCountReducer extends Reducer { public void reduce(Text text, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(text, new IntWritable(sum)); } } Step 4 : Driver Program Our driver program will configure the job by supplying the map and reduce program we just wrote along with various input , output parameters. public class WordCount { public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Path inputPath = new Path(args[0]); Path outputDir = new Path(args[1]); // Create configuration Configuration conf = new Configuration(true); // Create job Job job = new Job(conf, "WordCount"); job.setJarByClass(WordCountMapper.class); // Setup MapReduce job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setNumReduceTasks(1); // Specify key / value job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // Input FileInputFormat.addInputPath(job, inputPath); job.setInputFormatClass(TextInputFormat.class); // Output FileOutputFormat.setOutputPath(job, outputDir); job.setOutputFormatClass(TextOutputFormat.class); // Delete output if exists FileSystem hdfs = FileSystem.get(conf); if (hdfs.exists(outputDir)) hdfs.delete(outputDir, true); // Execute job int code = job.waitForCompletion(true) ? 0 : 1; System.exit(code); } } That’s It !! We are all set to execute our first MapReduce Program in eclipse in local mode. Let’s assume there is an input text file called input.txt in folder input which contains following text : foo bar is foo count count foo for saurzcode Expected output : foo 3 bar 1 is 1 count 2 for 1 saurzcode 1 Let’s run this program in eclipse as Java Application :- We need to give path to input and output folder/file to the program as argument.Also, note output folder shouldn’t exist before running this program else program will fail. java com.saurzcode.mapreduce.WordCount input/inputfile.txt output If this program runs successfully emitting set of lines while it is executing mappers and reducers, we should see a output folder and with following files : output/ _SUCCESS part-r-00000
January 30, 2015
· 12,447 Views · 1 Like
article thumbnail
How to Configure MySQL Metastore for Hive?
This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).
January 8, 2015
· 80,927 Views · 4 Likes
article thumbnail
How to configure Swagger to generate Restful API Doc for your Spring Boot Web Application
Learn How to Enable Swagger in your Spring Boot Web Application
August 26, 2014
· 127,838 Views · 3 Likes
article thumbnail
How to Setup Realtime Analytics over Logs with ELK Stack
Once we know something, we find it hard to imagine what it was like not to know it. - Chip & Dan Heath, Authors of Made to Stick, Switch Update: I have recently published a book on ELK stack titled - Learning ELK Stack , more details can be found here. What is the ELK stack ? The ELK stack is ElasticSearch, Logstash and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data. ElasticSearch ElasticSearch,built on top of Apache Lucene, is a search engine with focus on real-time analysis of the data, and is based on the RESTful architecture. It provides standard full text search functionality and powerful search based on query. ElasticSearch is document-oriented/based and you can store everything you want as JSON. This makes it powerful, simple and flexible. Logstash Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use.In ELK Stack logstash plays an important role in shipping the log and indexing them later which can be supplied to Elastic Search. Kibana Kibana is a user friendly way to view, search and visualize your log data, which will present the data stored from Logstash into ElasticSearch, in a very customizable interface with histogram and other panels which provides real-time analysis and search of data you have parsed into ElasticSearch. How Do I Get It ? http://www.elasticsearch.org/overview/elkdownloads/ How Do They Work Together ? Logstash is essentially a pipelining tool. In a basic, centralized installation a logstash agent, known as the shipper, will read input from one to many input sources and output that text wrapped in a JSON message to a broker. Typically Redis, the broker, caches the messages until another logstash agent, known as the collector, picks them up, and sends them to another output. In the common example this output is Elasticsearch, where the messages will be indexed and stored for searching. The Elasticsearch store is accessed via the Kibana web application which allows you to visualize and search through the logs. The entire system is scalable. Many different shippers may be running on many different hosts, watching log files and shipping the messages off to a cluster of brokers. Then many collectors can be reading those messages and writing them to an Elasticsearch cluster. (E)lasticSearch (L)ogstash (K)ibana (The ELK Stack) How Do I Fetch Useful Information Out of Logs? Fetching useful information from logs is one of the most important part of this stack and is being done in logstash using its grok filters and a set of input , filter and output plugins which helps to scale this functionality for taking various kinds of inputs ( file,tcp, udp, gemfire, stdin, unix, web sockets and even IRC and twitter and many more) , filter them using (groks,grep,date filters etc.) and finally write ouput to ElasticSearch,redis,email,HTTP,MongoDB,Gemfire , Jira , Google Cloud Storage etc. A Bit More About Log Stash Filters Transforming the logs as they go through the pipeline is possible as well using filters. Either on the shipper or collector, whichever suits your needs better. As an example, an Apache HTTP log entry can have each element (request, response code, response size, etc) parsed out into individual fields so they can be searched on more seamlessly. Information can be dropped if it isn’t important. Sensitive data can be masked. Messages can be tagged. The list goes on. e.g. input { file { path => ["var/log/apache.log"] type => "saurzcode_apache_logs" } } filter { grok { match => ["message","%{COMBINEDAPACHELOG}"] } } output{ stdout{} } Above example takes input from an apache log file applies a grok filter with %{COMBINEDAPACHELOG}, which will index apache logs information on fields and finally output to Standard Output Console. Writing Grok Filters Writing grok filters and fetching information is the only task that requires some serious efforts and if done properly will give you great insights in to your data like Number of Transations performed over time, Which type of products have most hits etc. Below links will help you a lot in writing grok filters and test them with ease - Grok Debugger http://grokdebug.herokuapp.com/ Grok Patterns Lookup https://github.com/elasticsearch/logstash/tree/v1.4.2/patterns References http://www.elasticsearch.org/overview/ http://logstash.net/ http://rashidkpc.github.io/Kibana/about.html
August 26, 2014
· 47,209 Views · 4 Likes
article thumbnail
Top 10 Hadoop Shell Commands to Manage HDFS
So you already know what Hadoop is? Why it is used? What problems you can solve with it?
June 30, 2014
· 484,553 Views · 8 Likes
article thumbnail
SOAP Webservices Using Apache CXF: Adding Custom Object as Header in Outgoing Requests
What is CXF? Apache CXF is an open source services framework. CXF helps you build and develop services using frontend programming APIs, like JAX-WS and JAX-RS. These services can speak a variety of protocols such as SOAP, XML/HTTP, RESTful HTTP, or CORBA and work over a variety of transports such as HTTP, JMS etc. How CXF Works? As you can see here and here, how cxf service calls are processed,most of the functionality in the Apache CXF runtime is implemented by interceptors. Every endpoint created by the Apache CXF runtime has potential interceptor chains for processing messages. The interceptors in the these chains are responsible for transforming messages between the raw data transported across the wire and the Java objects handled by the endpoint’s implementation code. Interceptors in CXF When a CXF client invokes a CXF server, there is an outgoing interceptor chain for the client and an incoming chain for the server. When the server sends the response back to the client, there is an outgoing chain for the server and an incoming one for the client. Additionally, in the case of SOAPFaults, a CXF web service will create a separate outbound error handling chain and the client will create an inbound error handling chain. The interceptors are organized into phases to ensure that processing happens on the proper order.Various phases involved during the Interceptor chains are listed in CXF documentation here. Adding your custom Interceptor involves extending one of the Abstract Intereceptor classes that CXF provides, and providing a phase when that interceptor should be invoked. AbstractPhaseInterceptor class - This abstract class provides implementations for the phase management methods of the PhaseInterceptor interface. The AbstractPhaseInterceptor class also provides a default implementation of the handleFault() method. Developers need to provide an implementation of the handleMessage() method. They can also provide a different implementation for the handleFault() method. The developer-provided implementations can manipulate the message data using the methods provided by the generic org.apache.cxf.message.Message interface. For applications that work with SOAP messages, Apache CXF provides an AbstractSoapInterceptor class. Extending this class provides the handleMessage() method and the handleFault() method with access to the message data as an org.apache.cxf.binding.soap.SoapMessage object. SoapMessage objects have methods for retrieving the SOAP headers, the SOAP envelope, and other SOAP metadata from the message. Below piece of code will show, how we can add a Custom Object as Header to an outgoing request – Spring Configuration - Interceptor :- public class SoapHeaderInterceptor extends AbstractSoapInterceptor { public SoapHeaderInterceptor() { super(Phase.POST_LOGICAL); } @Override public void handleMessage(SoapMessage message) throws Fault { List headers = message.getHeaders(); TestHeader testHeader = new TestHeader(); JAXBElement testHeaders = new ObjectFactory() .createTestHeader(testHeader); try { Header header = new Header(testHeaders.getName(), testHeader, new JAXBDataBinding(TestHeader.class)); headers.add(header); message.put(Header.HEADER_LIST, headers); } catch (JAXBException e) { e.printStackTrace(); } }
May 29, 2014
· 14,677 Views · 1 Like
article thumbnail
String Interning — What, Why, and When?
Learn about string interning, a method of storing only one copy of each distinct string value, which must be immutable.
May 23, 2014
· 145,424 Views · 13 Likes

Comments

Agile Development - Part 2

Sep 09, 2014 · Amit Mehra

Thanks for Sharing a great article Lieven !!

How to Setup Realtime Analytics over Logs with ELK Stack

Aug 27, 2014 · Saurabh Chhajed

Basic Framework remains as ELK, but to streamline the messages we can use any broker like Redis/RabbitMQ etc. that's up to us.

How to Setup Realtime Analytics over Logs with ELK Stack

Aug 27, 2014 · Saurabh Chhajed

Basic Framework remains as ELK, but to streamline the messages we can use any broker like Redis/RabbitMQ etc. that's up to us.

How to configure Swagger to generate Restful API Doc for your Spring Boot Web Application

Aug 26, 2014 · Saurabh Chhajed

You have to separate out SwaggerConfig not to be loaded only on tests,Simple solution I have worked out is use @Profile("default") on

SwaggerConfig and have your tests run on Active Profile "test".This will block Swagger Configs to load when loading tests.


Let me know if you need further help.

How to configure Swagger to generate Restful API Doc for your Spring Boot Web Application

Aug 26, 2014 · Saurabh Chhajed

You have to separate out SwaggerConfig not to be loaded only on tests,Simple solution I have worked out is use @Profile("default") on

SwaggerConfig and have your tests run on Active Profile "test".This will block Swagger Configs to load when loading tests.


Let me know if you need further help.

String Interning — What, Why, and When?

May 31, 2014 · Saurabh Chhajed

Thanks Miten Mehta !!

String Interning — What, Why, and When?

May 28, 2014 · Saurabh Chhajed

Yes Oleksandr, I agree with you on this.


String Interning — What, Why, and When?

May 28, 2014 · Saurabh Chhajed

Yes Oleksandr, I agree with you on this.


String Interning — What, Why, and When?

May 28, 2014 · Saurabh Chhajed

Yes Oleksandr, I agree with you on this.


December Column: Searching Algorithms

May 26, 2014 · Mr B Loid

Hello Ankur,

Nice Article!! This proved to be a good help for us while choosing between these technologies.Few points we observed -

  • Apache CXF was performing comparitively slower as compared to Axis2 and Spring WS for a sample implementation of one of our webservices.
  • Spring WS always woked like charm, it was quick to setup, and quite fast as compared to other two.
  • Good thing about Spring WS was it helped us eliminate the need to generate WSDL bound stubs for Service Interface using xjc utility which just required to generate stubs for Elements.
Thanks,Saurabh


Apache CXF vs. Apache AXIS vs. Spring WS

May 26, 2014 · mitchp

Hello Ankur,

Nice Article!! This proved to be a good help for us while choosing between these technologies.Few points we observed -

  • Apache CXF was performing comparitively slower as compared to Axis2 and Spring WS for a sample implementation of one of our webservices.
  • Spring WS always woked like charm, it was quick to setup, and quite fast as compared to other two.
  • Good thing about Spring WS was it helped us eliminate the need to generate WSDL bound stubs for Service Interface using xjc utility which just required to generate stubs for Elements.
Thanks,Saurabh


String Interning — What, Why, and When?

May 25, 2014 · Saurabh Chhajed

Hello Robert, This is fairly uncommon and not recommended to construct strings using new operator, since it will always store them in heap rather than string pool,for every newly constructed string and there may be a possibility of duplicate strings unnecessarily in your heap area.

Having said that, String Interning is not always recommended at least till java version 6, since it stores them in PermGen Area and will not be GCed so may create memory leak issues and also it is necessary that you remember to intern all the strings that you are looking to compare in order to gain some performance gain, otherwise there could be consistency issues with your application.

Please let me know your thoughts.


String Interning — What, Why, and When?

May 25, 2014 · Saurabh Chhajed

Hello Robert, This is fairly uncommon and not recommended to construct strings using new operator, since it will always store them in heap rather than string pool,for every newly constructed string and there may be a possibility of duplicate strings unnecessarily in your heap area.

Having said that, String Interning is not always recommended at least till java version 6, since it stores them in PermGen Area and will not be GCed so may create memory leak issues and also it is necessary that you remember to intern all the strings that you are looking to compare in order to gain some performance gain, otherwise there could be consistency issues with your application.

Please let me know your thoughts.


String Interning — What, Why, and When?

May 25, 2014 · Saurabh Chhajed

Hello Robert, This is fairly uncommon and not recommended to construct strings using new operator, since it will always store them in heap rather than string pool,for every newly constructed string and there may be a possibility of duplicate strings unnecessarily in your heap area.

Having said that, String Interning is not always recommended at least till java version 6, since it stores them in PermGen Area and will not be GCed so may create memory leak issues and also it is necessary that you remember to intern all the strings that you are looking to compare in order to gain some performance gain, otherwise there could be consistency issues with your application.

Please let me know your thoughts.


User has been successfully modified

Failed to modify user

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: