Comparing and Matching data in 2 Excel files

**Kingston86** · 04-25-2023, 08:44 AM

Hey all I wonder if you can help at all. I am looking for a solution to a problem I am having with 2 XML files that have been output by an external program and I want to compare the data in the two files across a select number of columns in which I know the data can be matched as these columns are common between both files. I have done a lot of google searching and digging to try and find a solution but haven't managed to as yet. The only thing I seemed to find that helped was that the native XML files had poor structure and so opening them and exporting as XLSX files gave a far better file structure which can then be used for my purpose.

What I am trying to create is something that will convert an XML file into an XLSX file and then use the XLSX file to do some data matching between a pre-set number of columns but data must match across all the designated columns in order to be a match, it can?t just be 1 or 2 that match and that?s good enough it needs to be all. Once it has found a match it outputs the full row of each file into a new file which has 2 sheets in, each sheet represents one of the original files, within each sheet is a newly created column called Key which contains a unique reference number starting at 0000001 and going up which allows for a reference point to the matching rows as I will be doing some filtering of the data at a later stage but for now just being able to match up the data would be amazing.

Here is a list of the columns in each file that should be used purely for the purpose of matching data.

Withing the forward file: 'Back Result', 'Trades', 'PipStep', 'PipStepExponent', 'DelayTradeSequence', 'ReverseSequenceDirection', 'LotSize', 'LotSizeExponent', 'RsiTimeframe', 'RsiPeriod', 'RsiSellLevel', 'EmaTimeframe', 'EmaPeriods', 'EmaTrendRule', 'AdxTimeframe', 'AdxPeriod','AdxThreshold', 'AdxTrendRule', 'NewsAction'

Within the back file: 'Result', 'Trades', 'PipStep', 'PipStepExponent', 'DelayTradeSequence', 'ReverseSequenceDirection', 'LotSize', 'LotSizeExponent', 'RsiTimeframe', 'RsiPeriod', 'RsiSellLevel', 'EmaTimeframe', 'EmaPeriods', 'EmaTrendRule', 'AdxTimeframe', 'AdxPeriod','AdxThreshold', 'AdxTrendRule', 'NewsAction'

Once a match has been found using all of the above columns to match the data then that entire row should be output to the new _Matched file.

The full column list from each file is as follows.

Forward file: ?Pass?, ?Forward Result?, ?Back Result?, ?Profit?, ?Expected Payoff?, ?Profit Factor?, ?Recovery Factor?, ?Sharpe Ratio?, ?Custom?, ?Equity DD %?, ?Trades?, 'PipStep', 'PipStepExponent', 'DelayTradeSequence', 'ReverseSequenceDirection', 'LotSize', 'LotSizeExponent', 'RsiTimeframe', 'RsiPeriod', 'RsiSellLevel', 'EmaTimeframe', 'EmaPeriods', 'EmaTrendRule', 'AdxTimeframe', 'AdxPeriod','AdxThreshold', 'AdxTrendRule', 'NewsAction'

Back File: ?Pass?, ?Result?, ?Profit?, ?Expected Payoff?, ?Profit Factor?, ?Recovery Factor?, ?Sharpe Ratio?, ?Custom?, ?Equity DD %?, ?Trades?, 'PipStep', 'PipStepExponent', 'DelayTradeSequence', 'ReverseSequenceDirection', 'LotSize', 'LotSizeExponent', 'RsiTimeframe', 'RsiPeriod', 'RsiSellLevel', 'EmaTimeframe', 'EmaPeriods', 'EmaTrendRule', 'AdxTimeframe', 'AdxPeriod','AdxThreshold', 'AdxTrendRule', 'NewsAction'

Would anyone know of a method I could use to achieve this? Thanks in advance for any help or suggestions offered.

**AliGW** · 04-25-2023, 08:45 AM

Welcome to the forum.

There are instructions at the top of the page explaining how to attach your sample workbook. Screenshots are of little practical use as we cannot manipulate them.

A good sample workbook has just 10-20 rows of representative data that has been desensitised. It also has expected results mocked up, worked examples where required, relevant cells highlighted and a few explanatory notes.

**Kingston86** · 04-25-2023, 09:01 AM

Hey AliGW,

Thank you for the warm welcome, I have attached some files to this post (hopefully) and below is an explanation of the files and their uses.

EURNZD_Forward_SAMPLE.xlsx & EURNZD_Back_SAMPLE.xlsx - These are the RAW files that are output from a third party program (MetaTrader 5) which is essentially all the data of a series of tests that have been run within that program exported to excel for the purpose of filtering etc. The columns that are highlighted in Yellow are the columns I am looking to match the data between as these Columns are common across both files, with the exception of one column being named differently between the two files which is the Back file results column. In the forward file it is referred to as 'Back Result' and within the back file it is simply labelled 'Result'

EURNZD_Matched_SAMPLE.xlsx - This would be the output file of the data once it has been matched and assigned a reference key. Once all data has been matched I will then run a macro within excel to bring some columns from the back sheet into the forward sheet and then filter the results based on a set of parameters, unfortunately in order to do this I first must match up the data and assign a key so that I can bring in columns using a VLOOKUP or similar method.

Hopefully this is of some help or gives a better understanding to my problem.

EURNZD_Back_SAMPLE.xlsx EURNZD_Forward_SAMPLE.xlsx EURNZD_Matched_SAMPLE.xlsx

**AliGW** · 04-25-2023, 09:05 AM

I think you might want to look at Power Query (Get & Transform on the Data ribbon).

Have a look at this tutorial: https://trumpexcel.com/combine-data-...ple-workbooks/

**AliGW** · 04-25-2023, 09:10 AM

Actually, on re-reading your opening post, I think you need VBA for this. Shall I move the thread for you?

**Kingston86** · 04-25-2023, 09:22 AM

If you could that would be greatly appreciated. Thank you

**AliGW** · 04-25-2023, 09:23 AM

Done. Hope you get some help!

**Kingston86** · 04-27-2023, 12:35 PM

Just as an update and in case anyone ever has a similar issue I guess. I managed to get the problem solved using a python script. Probably could have been done in VBA also but I had a friend who was able to solve it using python so I will mark this case as SOLVED and I have included the code below in case it helps anyone else out.

import os
import pandas as pd
import xml.etree.ElementTree as ET
import sys
import time
from colorama import init, Fore, Style
import hashlib


def typewriter(text, delay=0.1):
    for char in text:
        sys.stdout.write(char)
        sys.stdout.flush()
        time.sleep(delay)
    print()

def full_clean(directory):
    for root, dirs, files in os.walk(directory):
        for file in files:
            if '_Forward' in file or '_Back' in file:
                os.remove(os.path.join(root, file))

def partial_clean(directory):
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith('.xml'):
                os.remove(os.path.join(root, file))

def convert_xml_to_xlsx(xml_path, xlsx_path):
    try:
        tree = ET.parse(xml_path)
        root = tree.getroot()

        data = []
        for row in root.iter('{urn:schemas-microsoft-com:office:spreadsheet}Row'):
            row_data = [cell.find('{urn:schemas-microsoft-com:office:spreadsheet}Data').text for cell in row.findall('{urn:schemas-microsoft-com:office:spreadsheet}Cell')]
            data.append(row_data)

        df = pd.DataFrame(data[1:], columns=data[0])

        with pd.ExcelWriter(xlsx_path, engine='openpyxl') as writer:
            df.to_excel(writer, index=False)

    except FileNotFoundError:
        print(f"Error: {xml_path} not found.")
        return False
    except ET.ParseError as e:
        print(f"Error parsing XML file {xml_path}: {str(e)}")
        return False
    except Exception as e:
        print(f"Error converting {xml_path} to {xlsx_path}: {str(e)}")
        return False

    return True


def hash_columns(row, columns):
    concatenated = ''.join(str(row[col]) for col in columns)
    return hashlib.sha1(concatenated.encode('utf-8')).hexdigest()

def match_data(currency_pair, directory):
    print(Fore.YELLOW + "Converting XML to XLSX..." + Style.RESET_ALL)
    forward_xml_path = os.path.join(directory, f"{currency_pair}_Forward.xml")
    back_xml_path = os.path.join(directory, f"{currency_pair}_Back.xml")
    forward_xlsx_path = os.path.join(directory, f"{currency_pair}_Forward.xlsx")
    back_xlsx_path = os.path.join(directory, f"{currency_pair}_Back.xlsx")

    if not convert_xml_to_xlsx(forward_xml_path, forward_xlsx_path) or not convert_xml_to_xlsx(back_xml_path, back_xlsx_path):
        return

    print(Fore.YELLOW + "Reading individual files..." + Style.RESET_ALL)
    forward_df = pd.read_excel(forward_xlsx_path)
    back_df = pd.read_excel(back_xlsx_path)

    forward_keys = ['Back Result', 'PipStep', 'PipStepExponent', 'DelayTradeSequence', 'ReverseSequenceDirection', 'LotSize', 'LotSizeExponent', 'RsiTimeframe', 'RsiPeriod', 'RsiSellLevel', 'EmaTimeframe', 'EmaPeriods', 'AdxTimeframe', 'AdxPeriod', 'AdxThreshold', 'NewsAction']
    back_keys = ['Result', 'PipStep', 'PipStepExponent', 'DelayTradeSequence', 'ReverseSequenceDirection', 'LotSize', 'LotSizeExponent', 'RsiTimeframe', 'RsiPeriod', 'RsiSellLevel', 'EmaTimeframe', 'EmaPeriods', 'AdxTimeframe', 'AdxPeriod', 'AdxThreshold', 'NewsAction']

    print(Fore.YELLOW + "Matching data using specified columns..." + Style.RESET_ALL)
    forward_df['hash'] = forward_df.apply(lambda row: hash_columns(row, forward_keys), axis=1)
    back_df['hash'] = back_df.apply(lambda row: hash_columns(row, back_keys), axis=1)

    matched_df = forward_df.merge(back_df, on='hash', suffixes=('_forward', '_back'), how='inner').drop(columns=['hash'])

    print(Fore.YELLOW + "Generating key for matched data..." + Style.RESET_ALL)
    matched_df['key'] = matched_df.index + 1
    matched_df['key'] = matched_df['key'].apply(lambda x: f"{x:07d}")

    print(Fore.YELLOW + "Saving matched data to Excel file..." + Style.RESET_ALL)
    save_to_excel(matched_df, currency_pair, directory)

def save_to_excel(df, currency_pair, directory):
    output_path = os.path.join(directory, f"{currency_pair}_Matched.xlsx")

    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
        df[['key'] + [col for col in df.columns if col.endswith('_forward') or col == 'Forward Result' or col == 'Back Result']].to_excel(writer, index=False, sheet_name='Forward')
        df[['key'] + [col for col in df.columns if col.endswith('_back') or col == 'Result']].to_excel(writer, index=False, sheet_name='Back')

    print(Fore.GREEN + f"Matched data saved to {output_path}" + Style.RESET_ALL)


def main():
    while True:
        init()
        currency_pair = input(Fore.CYAN + "Please enter the currency pair (e.g. EURNZD): " + Style.RESET_ALL).upper()
        user_home = os.path.expanduser("~")
        downloads_dir = os.path.join(user_home, "Downloads")
        directory = os.path.join(downloads_dir, f"Automated MT5 Results\\{currency_pair}")

        if not os.path.exists(directory):
            print(f"Error: Directory '{directory}' does not exist.")
        else:
            match_data(currency_pair, directory)

            user_choice = input(
                Fore.CYAN + "Do you want to match another set of files? (Y/N): " + Style.RESET_ALL).upper()
            if user_choice != 'Y':
                cleanup_choice = input(
                    Fore.CYAN + "Do you want to clean up the folders? (Y/N): " + Style.RESET_ALL).upper()
                if cleanup_choice == 'Y':
                    clean_type = input(
                        Fore.CYAN + "Select the type of clean: Full (F) or Partial (P): " + Style.RESET_ALL).upper()
                    if clean_type == 'F':
                        full_clean(directory)
                        print(Fore.GREEN + "Full clean performed successfully." + Style.RESET_ALL)
                    elif clean_type == 'P':
                        partial_clean(directory)
                        print(Fore.GREEN + "Partial clean performed successfully." + Style.RESET_ALL)
                    else:
                        print(Fore.RED + "Invalid clean type selected. No cleanup performed." + Style.RESET_ALL)

                typewriter(
                    Fore.GREEN + Style.BRIGHT + "All matching has been completed successfully." + Style.RESET_ALL)
                break

if __name__ == "__main__":
    main()

Comparing and Matching data in 2 Excel files

LinkBack

Thread Tools

Rate This Thread

Display

Hybrid View

Comparing and Matching data in 2 Excel files

Re: Comparing and Matching data in 2 Excel files

Re: Comparing and Matching data in 2 Excel files

Re: Comparing and Matching data in 2 Excel files

Re: Comparing and Matching data in 2 Excel files

Re: Comparing and Matching data in 2 Excel files

Re: Comparing and Matching data in 2 Excel files

Re: Comparing and Matching data in 2 Excel files

Thread Information

Users Browsing this Thread

Similar Threads

Comparing two files for matching or similar names to retrieve ID numbers.

Finding Matching Data From Multiple Columns in 2 Different Excel Files

Comparing 2 columns that have similar matching data, need to extract the matching data

Look up data excel files of a folder, output all matching data in master file

comparing 2 excel files and culling the different data?

Using condition to extract data by matching values from 2 different Excel files

Matching data from seperate excel files

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions