Menu

fpsync parts number and intermediate runs

Fab11
2023-07-19
2023-10-26
  • Fab11

    Fab11 - 2023-07-19

    Hello,

    I have to migrate a large amount of data to a new storage array (around 100 TB), which is mostly composed of small files (max 100 KB) located into a 7 directory deep structure. The transfert to the new storage array has to be achieved through a SMB/CIFS share. So, Samba and small files, the worst case scenario...
    Using rsync or robocopy would take ages. I discovered fpart and fpsync by accident and it turned out to be the magic solution according to my speed tests. It would let me copy these data in two weeks instead of six months.

    I read with lot of interest the article "Parallélisez vos transferts de fichiers" (I'm french) but I would need some advises on two subjects:

    1) To not wait for big lists to be generated by fpart and start rsync right away, I set fpsync to use parts of 1000 files (-f). But I realised the number of parts is very huge (45,000+). I have to say data are spreaded on several 8 TB volumes, processed one at a time and one after the other. The server running the jobs has 48 cores and fpsync is setup to run 24 parallel rsync (-n).
    Is this a mistake? Should I use bigger parts, for instance 10,000 files each, to reduce the total part number?

    2) I didn't find, in the article and on the web, information regarding the intermediate synchronisations (the "middle" ones, before the final one). How this should be handled? Just running the same fpsync command again? Because running again existing jobs would skip the parts generation but will not take care of potential new files in the source, so restart from scratch seems mandatory. But it also means it will take almost as long as the first pass? In my case, files are so small I doubt rsync'ing them or not will make a change.
    Another solution could be the retrieve all parts of a job and merge them into a single or a few files to pass to rsync to run a few syncs instead of so many? But again, new files won't be taken into account.
    The documentation regarding jobs restart/replaying lacks details and examples in my opinion.

    How would you do?

    Thanks for your help and advise.

     
    • Ganael Laplanche

      On Wednesday, July 19, 2023 8:52:32 PM CEST Fab11 wrote:

      Hello,

      Sorry for my delayed answer, I am just back from holidays :p

      [...]
      1) To not wait for big lists to be generated by fpart and start rsync right
      away, I set fpsync to use parts of 1000 files (-f). But I realised the
      number of parts is very huge (45,000+). I have to say data are spreaded on
      several 8 TB volumes, processed one at a time and one after the other. The
      server running the jobs has 48 cores and fpsync is setup to run 24 parallel
      rsync (-n).
      Is this a mistake? Should I use bigger parts, for instance 10,000 files
      each, to reduce the total part number?

      Yes, I would probably try to generate less partitions (bigger ones), but there
      is no easy way to compute the ideal number :

      1) If you generate too many (small) partitions, you will loose time forking
      very small rsync processes.

      2) Fpsync is able to start transfers as soon as a single partition has been
      generated ; it generates the next ones during that transfer. If you start 24
      parallel rsync jobs, you probably want to generate those 24 first partitions
      as fast as possible, so you don't want them to be too big.

      You have to find a good balance here. Anyway, 1000 files definitely seems to
      small (IMHO).

      2) I didn't find, in the article and on the web, information regarding the
      intermediate synchronisations (the "middle" ones, before the final one).
      How this should be handled? Just running the same fpsync command again?
      Because running again existing jobs would skip the parts generation but
      will not take care of potential new files in the source, so restart from
      scratch seems mandatory. But it also means it will take almost as long as
      the first pass?

      Yes, re-running fpsync the same way will take the same time as the first pass.

      To avoid crawling time, you can use fpsync's replay feature (-R) but, as you
      say, it will only update known files' contents. If your files only change by
      contents, this can be a good solution. If they are frequently deleted and
      replaced by other ones (names are changing), that would not work as fpsync
      would skip most of them. There is no easy solution here, the only way to get
      new file names is to crawl the filesystem again (re-run fpsync from scratch).

      For the final pass, you can have a look at fpsync's option -E that makes it
      work on a directory basis and enables --delete option, but if you have very
      few directories it will not work very well (it will not be able to produce
      enough partitions).

      Another solution could be the retrieve all parts
      of a job and merge them into a single or a few files to pass to rsync to
      run a few syncs instead of so many? But again, new files won't be taken
      into account.

      That will just replay the synchronization. If that's what you want, you
      probably want to use option -E, it will be easier to handle.

      The documentation regarding jobs restart/replaying lacks
      details and examples in my opinion.
      How would you do?

      Thanks for that feedback, I'll add that to my TODO list.

      Here is a small example :

      $ fpsync -l
      <=== Listing runs

      Nothing has been run here.

      $ fpsync -n 2 /usr/src/ /var/tmp/src/

      That commands starts a first run...

      $ fpsync -l
      <=== Listing runs
      ===> Run: ID: 1690106684-1860, status: replayable (synchronization complete,
      use -R to replay)

      ...that becomes replayable afterwards with that command :

      $ fpsync -R -r 1690106684-1860

      You can prepare a run (i.e. not run rsync commands but just generate jobs)
      by adding -p to the initial command :

      $ fpsync -p -n 2 /usr/src/ /var/tmp/src/
      1690106994 <=== Successfully prepared run: 1690106992-4040
      $ fpsync -l
      <=== Listing runs
      ===> Run: ID: 1690106684-1860, status: replayable (synchronization complete,
      use -R to replay)
      ===> Run: ID: 1690106992-4040, status: resumable (synchronization not
      complete, use -r to resume)

      It then becomes resumable and can be started that way :

      $ fpsync -r 1690106992-4040

      A side note : you probably want to use current fpsync's git version as a bug
      has been fixed regarding resume/replay :

      https://github.com/martymac/fpart/commit/
      be14d1c172daca70a2502a231e75d72f9e398265

      Hope this helps,
      Best regards,
      (and thanks for your interest in fpart/fpsync !)

      --
      Ganael LAPLANCHE ganael.laplanche@martymac.org
      http://www.martymac.org | http://contribs.martymac.org
      FreeBSD: martymac martymac@FreeBSD.org, http://www.FreeBSD.org

       
  • Fab11

    Fab11 - 2023-07-25

    Hello,

    Many thanks for your detailed answer!

    You confirmed 1000 files is not enough, it's my opinion too. I realised that while looking at the top command output frequently. Parts files number passed on the command line to fpsync were way ahead of the last generated one, meaning fpart was faster generating parts than rsync running them. I will maybe try to generate bigger parts for intermediate runs.

    Another question regarding jobs (thanks for the details on the usage). I was wondering what would be the most efficient or the fastest between a replay (which will then skip file system crawling as it already has all the generated part files) and a single rsync. The difference being passing a single file list to rsync (created with a cat of all part files), as I guess fpsync will read all its parts and run as many rsync as parts.
    Would it make sense?

    I would also like to ask you about log files (not only logs but everything related to a job in /tmp/fpsync by default), which turned to be really big in my case (more than 20 GB for a 8 TB job). Aside from disk space, there is also a quirk because many files are generated. For each part, I saw at least three files: the part itself, a stdoutput logfile and stderror logfile. As the last two of them are in the same directory, I ended up having directory listing issues (wait basically). As most, if not all, stderror files are empty is everything works well, they are somehow useless.
    Would it make sense to consider adding an option to "clean" the log directory of empty files at the end of a job, or even at the end of each rsync before switching to the next one?

    Thanks again.

    Best regards.

     
    • Ganael Laplanche

      On Tuesday, July 25, 2023 3:52:17 PM CEST Fab11 wrote:

      Hello,

      Many thanks for your detailed answer!

      You're welcome :)

      You confirmed 1000 files is not enough, it's my opinion too. I realised that
      while looking at the top command output frequently. Parts files number
      passed on the command line to fpsync were way ahead of the last generated
      one, meaning fpart was faster generating parts than rsync running them. I
      will maybe try to generate bigger parts for intermediate runs.

      Yes, that's the way to go, for sure.

      Another question regarding jobs (thanks for the details on the usage). I was
      wondering what would be the most efficient or the fastest between a replay
      (which will then skip file system crawling as it already has all the
      generated part files) and a single rsync. The difference being passing a
      single file list to rsync (created with a cat of all part files), as I
      guess fpsync will read all its parts and run as many rsync as parts. Would
      it make sense?

      I was suggesting the replay feature to avoid the burden of concatenating all
      partitions, but if you don't care handling that by hand, it could be worth
      testing it. A single rsync might be fast too as rsync will probably be mostly
      comparing file metadata ; to be honest, this has to be tested to bring an
      appropriate answer here.

      I would also like to ask you about log files (not only logs but everything
      related to a job in /tmp/fpsync by default), which turned to be really big
      in my case (more than 20 GB for a 8 TB job). Aside from disk space, there
      is also a quirk because many files are generated. For each part, I saw at
      least three files: the part itself, a stdoutput logfile and stderror
      logfile. As the last two of them are in the same directory, I ended up
      having directory listing issues (wait basically). As most, if not all,
      stderror files are empty is everything works well, they are somehow
      useless. Would it make sense to consider adding an option to "clean" the
      log directory of empty files at the end of a job, or even at the end of
      each rsync before switching to the next one?

      Yes, that is an interesting idea, that could even be done by default as empty
      files are probably useless. I can't look at that now but I'll add that to my
      TODO list, thanks for the idea!

      Best regards,

      --
      Ganael LAPLANCHE ganael.laplanche@martymac.org
      http://www.martymac.org | http://contribs.martymac.org
      FreeBSD: martymac martymac@FreeBSD.org, http://www.FreeBSD.org

       
  • Fab11

    Fab11 - 2023-07-28

    Hello,

    I will try that and let you know.
    You are welcome for the ideas :) Thanks for your consideration.

    Best regards.

     
  • Fab11

    Fab11 - 2023-10-24

    Hello,

    Sorry to come back with a reply so late.

    I increased parts number from 1000 to 10000 but it was still way faster to generate. I then moved to 100000 and it looks to suit better. In the worst case, the last part was generated just an hour before the end of the sync (half an hour on average) for a 7h to 9h job duration.

    Difficult to say if it was quicker and how much because it was a second pass anyway. So a big part of the time rsync took to copy files in the first pass was not spent again during the second pass.

    To give an overview of the jobs, they were pretty much balanced. Each job generated around 1900 parts of 100000 files, like said before lasting between 7 to 9 hours, generating around 10 Go of part files. Still using 24 parallel threads. Same volumes, each one being 8 TB.

    I couldn't compare a fpsync replay job versus a single rsync using a single file list. Indeed, I was able to generate lists of only modified files in the source (from the application using those data), so I ended up with an optimized list.

    So, finding the right settings is not easy... But the tool is very powerfull and hugely speed up such a copy process compared to regular ones (cp, mv, rsync, robocopy, ...) so the big value-added is there anyway.

    I have other suggestions regarding log files and jobs:
    - It would be nice to have an option to replace the job id (run id), which looks like a timestamp, by its human-readable format YYYYDDMM-HHMMSS (or similar) or even by a custom name
    - Having fpsync -l showing as well a status to quickly identify if there were errors during the run. This may be retrieved from the last lines of log/<run_id>/fpsync.log which outputs such a status?
    - It would be great to replay only errors of a job, so the process would be even faster and optimized. As errors are all logged in log/<run_id>/*.stderr, I was wondering if new lists could be generated from there

    What do you think about all this?

    Best regards.

     
    • Ganael Laplanche

      On 10/24/23 18:57, Fab11 wrote:

      Hello there,

      I increased parts number from 1000 to 10000 but it was still way faster
      to generate. I then moved to 100000 and it looks to suit better. In the
      worst case, the last part was generated just an hour before the end of
      the sync (half an hour on average) for a 7h to 9h job duration.

      Difficult to say if it was quicker and how much because it was a second
      pass anyway. So a big part of the time rsync took to copy files in the
      first pass was not spent again during the second pass.

      To give an overview of the jobs, they were pretty much balanced. Each
      job generated around 1900 parts of 100000 files, like said before
      lasting between 7 to 9 hours, generating around 10 Go of part files.
      Still using 24 parallel threads. Same volumes, each one being 8 TB.

      I couldn't compare a fpsync replay job versus a single rsync using a
      single file list. Indeed, I was able to generate lists of only modified
      files in the source (from the application using those data), so I ended
      up with an optimized list.

      So, finding the right settings is not easy... But the tool is very
      powerfull and hugely speed up such a copy process compared to regular
      ones (cp, mv, rsync, robocopy, ...) so the big value-added is there anyway.

      Yes, finding the right settings is not easy... Testing and adapting
      values like you did is probably the (only) way to go.

      Anyway, good to read that, and thanks for your detailed feedback :)

      I have other suggestions regarding log files and jobs:
      - It would be nice to have an option to replace the job id (run id),
      which looks like a timestamp, by its human-readable format
      YYYYDDMM-HHMMSS (or similar) or even by a custom name
      - Having |fpsync -l| showing as well a status to quickly identify if
      there were errors during the run. This may be retrieved from the last
      lines of |log/<run_id>/fpsync.log| which outputs such a status?</run_id>

      Those are good ideas. I'll add that to my TODO file, thanks.

      • It would be great to replay only errors of a job, so the process would
        be even faster and optimized. As errors are all logged in
        |log/<run_id>/*.stderr|, I was wondering if new lists could be generated
        from there</run_id>

      This is probably already in my TODO file as "Ability to replay one or
      more jobs within a specific run". If we can identify jobs to replay,
      then we're all good. Maybe this could be displayed in "fpsync -l" output
      too when a run has had errors ?

      Thanks for those ideas. They will help improving fpsync!

      Best regards,

      --
      Ganael LAPLANCHE ganael.laplanche@martymac.org
      http://www.martymac.org | http://contribs.martymac.org
      FreeBSD: martymac martymac@FreeBSD.org, http://www.FreeBSD.org

       

Log in to post a comment.

MongoDB Logo MongoDB