Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

A Comparison of Two Large CSV Files in MuleSoft

DZone's Guide to

A Comparison of Two Large CSV Files in MuleSoft

One file is 5 MB and the other is 1 MB. How can we deal with such large files without setting aside a huge amount of time?

· Integration Zone ·
Free Resource

How to Transform Your Business in the Digital Age: Learn how organizations are re-architecting their integration strategy with data-driven app integration for true digital transformation.

In this scenario, we have two CSV files. The requirement is to replace the fourth field J_D with I_D in File 1 if first three fields in both the files are the same.

File 1: JDEFile

J_A,J_B,J_C,J_D
82301,8179613,20161219,555
82301,8179613,20161226,1155
82301,8179613,20170102,954
82301,8179613,20170109,668
82301,8179613,20170116,968
82301,8179613,20170123,602
82301,8179613,20170130,782
82301,8179613,20170206,815
82301,8179613,20170213,632

File 2: IL File

I_A,I_B,I_C,I_D
82301,8179613,20170213,632
82301,8179613,20170220,632
82301,8179613,20170206,810
82301,8179613,20170123,6021

Approach 1

<mulerequester:requestresource="file://D:/LOC/JDE_UPLOAD_REPORT.csv?autoDelete=false"doc:name="JD_File"/>
        <set-variable variableName="jdeFile" value="#[payload]"mimeType="application/csv" doc:name="JDEFile"/>
        <mulerequester:requestresource="file://D:/LOC/UK_EU_DSFCST_IL.csv?autoDelete=false"doc:name="IL_File"/>
        <set-variable variableName="ilFile" value="#[payload]"doc:name="ILFile" mimeType="application/csv"/>
        <dw:transform-message metadata:id="bb792c8d-6ebf-4343-a470-81746b5b1873" doc:name="FilterMap">
            <dw:input-variable mimeType="application/csv"variableName="jdeFile"/>
            <dw:input-variable mimeType="application/csv"variableName="ilFile"/>
            <dw:set-payload><![CDATA[
%dw 1.0
%output application/csv
%var D = null
%function checkPresence(A,B,C) flowVars.ilFile[?($[0] == A and $[1] ==B and $[2] == C)][0][3]  
---
flowVars.jdeFile map ((jdeFile , indexOfJdeFile) -> {
      R_A: jdeFile[0],
      R_B: jdeFile[1],
      R_C: jdeFile[2],
      R_D: checkPresence(jdeFile[0],jdeFile[1],jdeFile[2]) default 
             jdeFile[2]
})
]]></dw:set-payload>
</dw:transform-message>

Well, Approach 1 was not efficient for dealing with larger files. It takes more than 30 minutes to scan and compare. Let's move on to Approach 2.

Approach 2

File 1: JDEFile (size 5 MB).

File 2: IL File (size 1 MB).

XML flow.

The highlighted code shows the usage of a Java code inside a DataWeave script.

  • Read JD_File and store it in a flow variable named jdeFile as a string. Headers are explicitly added as the first line gets omitted in the responses of the DataWeave scripts.

  • Read IL_File and transform it into an ArrayList of HashMaps. 

  • Java code is used to add the elements of all the HashMaps in the ArrayList to one single HashMap. Refer to the file CreateHashMap.java. 

<?xml version="1.0" encoding="UTF-8"?>
<mule xmlns:cluster="
      . . . . . .
      . . . . . .
 http://www.mulesoft.org/schema/mule/http http://www.mulesoft.org/schema/mule/http/current/mule-http.xsd">
 <file:connector name="File" autoDelete="false"  streaming="false" validateConnections="true" doc:name="File"/>
 <configuration doc:name="Configuration">
  <expression-language>
               <global-functions>
                def getKey(value) {
                return payload.get(value)
                }
               </global-functions>
  </expression-language>
 </configuration>
    <flow name="filesrbFlow">
        <poll doc:name="Poll">
            <fixed-frequency-scheduler frequency="1000" timeUnit="DAYS"/>
            <logger message="****************************Started*****************************8" level="INFO" doc:name="Logger"/>
        </poll>
        <mulerequester:request resource="file://D:/Accounts/POC/JDE_UPLOAD_REPORT.csv?autoDelete=false" doc:name="JD_File"/>
        <set-variable variableName="jdeFile" value="J_A,J_B,J_C,J_D#['\n']#[message.payloadAs(java.lang.String)]" mimeType="application/csv" doc:name="JDEFile"/>
        <mulerequester:request resource="file://D:/Accounts/POC/UK_EU_DSFCST_IL.csv?autoDelete=false" doc:name="IL_File"/>
        <set-payload value="I_A,I_B,I_C,I_D#['\n']#[message.payloadAs(java.lang.String)]" mimeType="application/csv" doc:name="Set Payload"/>
        <dw:transform-message metadata:id="af17dc95-43d6-4749-b97e-f7d4c8ef58b0" doc:name="IL_File_Map">
            <dw:input-payload mimeType="application/csv"/>
            <dw:input-variable mimeType="application/csv" variableName="jdeFile"/>
            <dw:set-payload><![CDATA[%dw 1.0
%output application/java
---
payload map ((payload01 , indexOfPayload01) -> {
      ABC: payload01[0] ++ payload01[1] ++ payload01[2],
      D: payload01[3]
})]]></dw:set-payload>

        </dw:transform-message>
        <logger message="*********************Starting FlowStep - Transforming File Now****************************" level="INFO" doc:name="Logger"/>
        <component class="filesrb.utils.CreateHashMap" doc:name="Java"/>
        <dw:transform-message metadata:id="bb792c8d-6ebf-4343-a470-81746b5b1873" doc:name="SwapCreate">
            <dw:input-variable mimeType="application/csv" variableName="jdeFile"/>
            <dw:set-payload><![CDATA[%dw 1.0
%output application/csv header=false, separator = ","
---
flowVars.jdeFile map ((jdeFile , indexOfJdeFile) -> {
      A: jdeFile[0],
      B: jdeFile[1],
      C: jdeFile[2],
      D: getKey(jdeFile[0] ++ jdeFile[1] ++ jdeFile[2]) default jdeFile[3]
     }
)]]></dw:set-payload>
        </dw:transform-message>
        <logger message="*********************Writing File Now****************************" level="INFO" doc:name="Logger"/>
        <file:outbound-endpoint path="D:\Accounts\RB\POC" outputPattern="RESPONSE.csv" connector-ref="File" responseTimeout="10000" doc:name="File"/>
        <logger message="*********************End Transforming &amp; Writing ****************************" level="INFO" doc:name="Logger"/>
    </flow>
</mule>

Java Code

package filesrb.utils;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.mule.api.MuleEventContext;
import org.mule.api.lifecycle.Callable;

      public class CreateHashMap implements Callable {
      public void run() {       
      }

      @SuppressWarnings("unchecked")
      @Override
      public Object onCall(MuleEventContext eventContext) throws Exception {
            // TODO Auto-generated method stub
            Map<String, String> ilfilemap  = new HashMap<String, String>();
            List<Map<String, String>> list = (ArrayList<Map<String, String>>) eventContext.getMessage().getPayload();
            for (Map<String, String> entry : list) {
                  String keyValue[] = new String[2];
                  int ctr=0;
                for (String key : entry.keySet()) {
                    String value = entry.get(key);
                    keyValue[ctr] = value;
                    ctr = ctr + 1;
                }
                ilfilemap.put(keyValue[0], keyValue[1]);
            }
            return ilfilemap;
      }
}

The above task can be done without using Java code. 

In this flow, we are adding values to a flowVariable in form of a HashMap using a Groovy script embedded inside the DataWeave script. Further, the same variable is used to get the value based on the key.

XML Flow

<?xml version="1.0" encoding="UTF-8"?>
<mule xmlns:cluster=" 
    . . . . . .
    . . . . . . ">
    <file:connector name="File" autoDelete="false"  streaming="false" validateConnections="true" doc:name="File"/>
    <configuration doc:name="Configuration">
<expression-language>
               <global-functions>
                def getKey(value) {
                return flowVars.ilFile.get(value)
                }
                def putKeyValue(key,value) {
                  flowVars.ilFile.put(key,value)           
                }
               </global-functions>
            </expression-language>
 </configuration>
    <flow name="filesrbFlow">
        <poll doc:name="Poll">
            <fixed-frequency-scheduler frequency="1000" timeUnit="DAYS"/>
            <logger message="****************************Started*****************************8" level="INFO" doc:name="Logger"/>
        </poll>
        <mulerequester:request resource="file://D:/Accounts/POC/JDE_UPLOAD_REPORT.csv?autoDelete=false" doc:name="JD_File_Read"/>
        <set-variable variableName="jdeFile" value="J_A,J_B,J_C,J_D#['\n']#[message.payloadAs(java.lang.String)]" mimeType="application/csv" doc:name="JDE_File_Var"/>
        <mulerequester:request resource="file://D:/Accounts/POC/UK_EU_DSFCST_IL.csv?autoDelete=false" doc:name="IL_File_Read"/>
        <set-variable variableName="ilFile" value="#[new java.util.HashMap()]" doc:name="IL_File_Var"/>
        <set-payload value="I_A,I_B,I_C,I_D#['\n']#[message.payloadAs(java.lang.String)]" mimeType="application/csv" doc:name="Set Payload"/>
        <dw:transform-message metadata:id="af17dc95-43d6-4749-b97e-f7d4c8ef58b0" doc:name="IL_File_Map">
            <dw:input-payload mimeType="application/csv"/>
            <dw:input-variable mimeType="application/csv" variableName="jdeFile"/>
            <dw:set-payload><![CDATA[%dw 1.0
%output application/java
---
payload map ((payload01 , indexOfPayload01) -> {
      ABC: putKeyValue(payload01[0] ++ payload01[1] ++ payload01[2], payload01[3])
})]]></dw:set-payload>

        </dw:transform-message>
        <logger message="*********************Starting FlowStep - Transforming File Now****************************" level="INFO" doc:name="Logger"/>

        <dw:transform-message metadata:id="bb792c8d-6ebf-4343-a470-81746b5b1873" doc:name="SwapCreate">
            <dw:input-variable mimeType="application/csv" variableName="jdeFile"/>
            <dw:set-payload><![CDATA[%dw 1.0
%output application/csv header=false, separator = ","
---
flowVars.jdeFile map ((jdeFile , indexOfJdeFile) -> {
      A: jdeFile[0],
      B: jdeFile[1],
      C: jdeFile[2],
      D: getKey(jdeFile[0] ++ jdeFile[1] ++ jdeFile[2]) default jdeFile[3]

      }
)]]></dw:set-payload>
        </dw:transform-message>

        <logger message="*********************Writing File Now****************************" level="INFO" doc:name="Logger"/>
        <file:outbound-endpoint path="D:\Accounts\RB\POC" outputPattern="RESPONSE.csv" connector-ref="File" responseTimeout="10000" doc:name="File"/>
        <logger message="*********************End Transforming &amp; Writing ****************************" level="INFO" doc:name="Logger"/>
    </flow>
</mule>

Make your mark on the industry’s leading annual report. Fill out the State of API Integration 2019 Survey and receive $25 to the Cloud Elements store.

Topics:
dataweave ,csv ,mulesoft ,integration

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}