fpart / Discussion / General Discussion: fpsync parts number and intermediate runs

Fab11 - 2023-07-19

Hello,

I have to migrate a large amount of data to a new storage array (around 100 TB), which is mostly composed of small files (max 100 KB) located into a 7 directory deep structure. The transfert to the new storage array has to be achieved through a SMB/CIFS share. So, Samba and small files, the worst case scenario...
Using rsync or robocopy would take ages. I discovered fpart and fpsync by accident and it turned out to be the magic solution according to my speed tests. It would let me copy these data in two weeks instead of six months.

I read with lot of interest the article "Parallélisez vos transferts de fichiers" (I'm french) but I would need some advises on two subjects:

1) To not wait for big lists to be generated by fpart and start rsync right away, I set fpsync to use parts of 1000 files (-f). But I realised the number of parts is very huge (45,000+). I have to say data are spreaded on several 8 TB volumes, processed one at a time and one after the other. The server running the jobs has 48 cores and fpsync is setup to run 24 parallel rsync (-n).
Is this a mistake? Should I use bigger parts, for instance 10,000 files each, to reduce the total part number?

2) I didn't find, in the article and on the web, information regarding the intermediate synchronisations (the "middle" ones, before the final one). How this should be handled? Just running the same fpsync command again? Because running again existing jobs would skip the parts generation but will not take care of potential new files in the source, so restart from scratch seems mandatory. But it also means it will take almost as long as the first pass? In my case, files are so small I doubt rsync'ing them or not will make a change.
Another solution could be the retrieve all parts of a job and merge them into a single or a few files to pass to rsync to run a few syncs instead of so many? But again, new files won't be taken into account.
The documentation regarding jobs restart/replaying lacks details and examples in my opinion.

How would you do?

Thanks for your help and advise.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ganael Laplanche - 2023-07-23
  
  On Wednesday, July 19, 2023 8:52:32 PM CEST Fab11 wrote:
  
  Hello,
  
  Sorry for my delayed answer, I am just back from holidays :p
  
  [...]
  1) To not wait for big lists to be generated by fpart and start rsync right
  away, I set fpsync to use parts of 1000 files (-f). But I realised the
  number of parts is very huge (45,000+). I have to say data are spreaded on
  several 8 TB volumes, processed one at a time and one after the other. The
  server running the jobs has 48 cores and fpsync is setup to run 24 parallel
  rsync (-n).
  Is this a mistake? Should I use bigger parts, for instance 10,000 files
  each, to reduce the total part number?
  
  Yes, I would probably try to generate less partitions (bigger ones), but there
  is no easy way to compute the ideal number :
  
  1) If you generate too many (small) partitions, you will loose time forking
  very small rsync processes.
  
  2) Fpsync is able to start transfers as soon as a single partition has been
  generated ; it generates the next ones during that transfer. If you start 24
  parallel rsync jobs, you probably want to generate those 24 first partitions
  as fast as possible, so you don't want them to be too big.
  
  You have to find a good balance here. Anyway, 1000 files definitely seems to
  small (IMHO).
  
  2) I didn't find, in the article and on the web, information regarding the
  intermediate synchronisations (the "middle" ones, before the final one).
  How this should be handled? Just running the same fpsync command again?
  Because running again existing jobs would skip the parts generation but
  will not take care of potential new files in the source, so restart from
  scratch seems mandatory. But it also means it will take almost as long as
  the first pass?
  
  Yes, re-running fpsync the same way will take the same time as the first pass.
  
  To avoid crawling time, you can use fpsync's replay feature (-R) but, as you
  say, it will only update known files' contents. If your files only change by
  contents, this can be a good solution. If they are frequently deleted and
  replaced by other ones (names are changing), that would not work as fpsync
  would skip most of them. There is no easy solution here, the only way to get
  new file names is to crawl the filesystem again (re-run fpsync from scratch).
  
  For the final pass, you can have a look at fpsync's option -E that makes it
  work on a directory basis and enables --delete option, but if you have very
  few directories it will not work very well (it will not be able to produce
  enough partitions).
  
  Another solution could be the retrieve all parts
  of a job and merge them into a single or a few files to pass to rsync to
  run a few syncs instead of so many? But again, new files won't be taken
  into account.
  
  That will just replay the synchronization. If that's what you want, you
  probably want to use option -E, it will be easier to handle.
  
  The documentation regarding jobs restart/replaying lacks
  details and examples in my opinion.
  How would you do?
  
  Thanks for that feedback, I'll add that to my TODO list.
  
  Here is a small example :
  
  $ fpsync -l
  <=== Listing runs
  
  Nothing has been run here.
  
  $ fpsync -n 2 /usr/src/ /var/tmp/src/
  
  That commands starts a first run...
  
  $ fpsync -l
  <=== Listing runs
  ===> Run: ID: 1690106684-1860, status: replayable (synchronization complete,
  use -R to replay)
  
  ...that becomes replayable afterwards with that command :
  
  $ fpsync -R -r 1690106684-1860
  
  You can prepare a run (i.e. not run rsync commands but just generate jobs)
  by adding -p to the initial command :
  
  $ fpsync -p -n 2 /usr/src/ /var/tmp/src/
  1690106994 <=== Successfully prepared run: 1690106992-4040
  $ fpsync -l
  <=== Listing runs
  ===> Run: ID: 1690106684-1860, status: replayable (synchronization complete,
  use -R to replay)
  ===> Run: ID: 1690106992-4040, status: resumable (synchronization not
  complete, use -r to resume)
  
  It then becomes resumable and can be started that way :
  
  $ fpsync -r 1690106992-4040
  
  A side note : you probably want to use current fpsync's git version as a bug
  has been fixed regarding resume/replay :
  
  https://github.com/martymac/fpart/commit/
  be14d1c172daca70a2502a231e75d72f9e398265
  
  Hope this helps,
  Best regards,
  (and thanks for your interest in fpart/fpsync !)
  
  --
  Ganael LAPLANCHE ganael.laplanche@martymac.org
  http://www.martymac.org | http://contribs.martymac.org
  FreeBSD: martymac martymac@FreeBSD.org, http://www.FreeBSD.org
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Fab11 - 2023-07-25

Hello,

Many thanks for your detailed answer!

You confirmed 1000 files is not enough, it's my opinion too. I realised that while looking at the top command output frequently. Parts files number passed on the command line to fpsync were way ahead of the last generated one, meaning fpart was faster generating parts than rsync running them. I will maybe try to generate bigger parts for intermediate runs.

Another question regarding jobs (thanks for the details on the usage). I was wondering what would be the most efficient or the fastest between a replay (which will then skip file system crawling as it already has all the generated part files) and a single rsync. The difference being passing a single file list to rsync (created with a cat of all part files), as I guess fpsync will read all its parts and run as many rsync as parts.
Would it make sense?

I would also like to ask you about log files (not only logs but everything related to a job in /tmp/fpsync by default), which turned to be really big in my case (more than 20 GB for a 8 TB job). Aside from disk space, there is also a quirk because many files are generated. For each part, I saw at least three files: the part itself, a stdoutput logfile and stderror logfile. As the last two of them are in the same directory, I ended up having directory listing issues (wait basically). As most, if not all, stderror files are empty is everything works well, they are somehow useless.
Would it make sense to consider adding an option to "clean" the log directory of empty files at the end of a job, or even at the end of each rsync before switching to the next one?

Thanks again.

Best regards.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ganael Laplanche - 2023-07-27
  
  On Tuesday, July 25, 2023 3:52:17 PM CEST Fab11 wrote:
  
  Hello,
  
  Many thanks for your detailed answer!
  
  You're welcome :)
  
  You confirmed 1000 files is not enough, it's my opinion too. I realised that
  while looking at the top command output frequently. Parts files number
  passed on the command line to fpsync were way ahead of the last generated
  one, meaning fpart was faster generating parts than rsync running them. I
  will maybe try to generate bigger parts for intermediate runs.
  
  Yes, that's the way to go, for sure.
  
  Another question regarding jobs (thanks for the details on the usage). I was
  wondering what would be the most efficient or the fastest between a replay
  (which will then skip file system crawling as it already has all the
  generated part files) and a single rsync. The difference being passing a
  single file list to rsync (created with a cat of all part files), as I
  guess fpsync will read all its parts and run as many rsync as parts. Would
  it make sense?
  
  I was suggesting the replay feature to avoid the burden of concatenating all
  partitions, but if you don't care handling that by hand, it could be worth
  testing it. A single rsync might be fast too as rsync will probably be mostly
  comparing file metadata ; to be honest, this has to be tested to bring an
  appropriate answer here.
  
  I would also like to ask you about log files (not only logs but everything
  related to a job in /tmp/fpsync by default), which turned to be really big
  in my case (more than 20 GB for a 8 TB job). Aside from disk space, there
  is also a quirk because many files are generated. For each part, I saw at
  least three files: the part itself, a stdoutput logfile and stderror
  logfile. As the last two of them are in the same directory, I ended up
  having directory listing issues (wait basically). As most, if not all,
  stderror files are empty is everything works well, they are somehow
  useless. Would it make sense to consider adding an option to "clean" the
  log directory of empty files at the end of a job, or even at the end of
  each rsync before switching to the next one?
  
  Yes, that is an interesting idea, that could even be done by default as empty
  files are probably useless. I can't look at that now but I'll add that to my
  TODO list, thanks for the idea!
  
  Best regards,
  
  --
  Ganael LAPLANCHE ganael.laplanche@martymac.org
  http://www.martymac.org | http://contribs.martymac.org
  FreeBSD: martymac martymac@FreeBSD.org, http://www.FreeBSD.org
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Fab11 - 2023-07-28

Hello,

I will try that and let you know.
You are welcome for the ideas :) Thanks for your consideration.

Best regards.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Fab11 - 2023-10-24

Hello,

Sorry to come back with a reply so late.

I increased parts number from 1000 to 10000 but it was still way faster to generate. I then moved to 100000 and it looks to suit better. In the worst case, the last part was generated just an hour before the end of the sync (half an hour on average) for a 7h to 9h job duration.

Difficult to say if it was quicker and how much because it was a second pass anyway. So a big part of the time rsync took to copy files in the first pass was not spent again during the second pass.

To give an overview of the jobs, they were pretty much balanced. Each job generated around 1900 parts of 100000 files, like said before lasting between 7 to 9 hours, generating around 10 Go of part files. Still using 24 parallel threads. Same volumes, each one being 8 TB.

I couldn't compare a fpsync replay job versus a single rsync using a single file list. Indeed, I was able to generate lists of only modified files in the source (from the application using those data), so I ended up with an optimized list.

So, finding the right settings is not easy... But the tool is very powerfull and hugely speed up such a copy process compared to regular ones (cp, mv, rsync, robocopy, ...) so the big value-added is there anyway.

I have other suggestions regarding log files and jobs:
- It would be nice to have an option to replace the job id (run id), which looks like a timestamp, by its human-readable format YYYYDDMM-HHMMSS (or similar) or even by a custom name
- Having fpsync -l showing as well a status to quickly identify if there were errors during the run. This may be retrieved from the last lines of log/<run_id>/fpsync.log which outputs such a status?
- It would be great to replay only errors of a job, so the process would be even faster and optimized. As errors are all logged in log/<run_id>/*.stderr, I was wondering if new lists could be generated from there

What do you think about all this?

Best regards.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ganael Laplanche - 2023-10-26
  
  On 10/24/23 18:57, Fab11 wrote:
  
  Hello there,
  
  I increased parts number from 1000 to 10000 but it was still way faster
  to generate. I then moved to 100000 and it looks to suit better. In the
  worst case, the last part was generated just an hour before the end of
  the sync (half an hour on average) for a 7h to 9h job duration.
  
  Difficult to say if it was quicker and how much because it was a second
  pass anyway. So a big part of the time rsync took to copy files in the
  first pass was not spent again during the second pass.
  
  To give an overview of the jobs, they were pretty much balanced. Each
  job generated around 1900 parts of 100000 files, like said before
  lasting between 7 to 9 hours, generating around 10 Go of part files.
  Still using 24 parallel threads. Same volumes, each one being 8 TB.
  
  I couldn't compare a fpsync replay job versus a single rsync using a
  single file list. Indeed, I was able to generate lists of only modified
  files in the source (from the application using those data), so I ended
  up with an optimized list.
  
  So, finding the right settings is not easy... But the tool is very
  powerfull and hugely speed up such a copy process compared to regular
  ones (cp, mv, rsync, robocopy, ...) so the big value-added is there anyway.
  
  Yes, finding the right settings is not easy... Testing and adapting
  values like you did is probably the (only) way to go.
  
  Anyway, good to read that, and thanks for your detailed feedback :)
  
  I have other suggestions regarding log files and jobs:
  - It would be nice to have an option to replace the job id (run id),
  which looks like a timestamp, by its human-readable format
  YYYYDDMM-HHMMSS (or similar) or even by a custom name
  - Having |fpsync -l| showing as well a status to quickly identify if
  there were errors during the run. This may be retrieved from the last
  lines of |log/<run_id>/fpsync.log| which outputs such a status?</run_id>
  
  Those are good ideas. I'll add that to my TODO file, thanks.
  
  It would be great to replay only errors of a job, so the process would
  be even faster and optimized. As errors are all logged in
  |log/<run_id>/*.stderr|, I was wondering if new lists could be generated
  from there</run_id>
  
  This is probably already in my TODO file as "Ability to replay one or
  more jobs within a specific run". If we can identify jobs to replay,
  then we're all good. Maybe this could be displayed in "fpsync -l" output
  too when a run has had errors ?
  
  Thanks for those ideas. They will help improving fpsync!
  
  Best regards,
  
  --
  Ganael LAPLANCHE ganael.laplanche@martymac.org
  http://www.martymac.org | http://contribs.martymac.org
  FreeBSD: martymac martymac@FreeBSD.org, http://www.FreeBSD.org
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

fpsync parts number and intermediate runs

Sort files and pack them into partitions

Forums

Help

fpsync parts number and intermediate runs

fpsync parts number and intermediate runs

Sort files and pack them into partitions

Forums

Help

fpsync parts number and intermediate runs document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

fpsync parts number and intermediate runs