126 Commits

Author SHA1 Message Date
Günther Nußmüller
d77765a23e Fix UnboundLocalError with plugins and largefiles
When Plugins are used in a repository that contains largefiles,
the following exception is thrown as soon as the first largefile
is converted:

```
Traceback (most recent call last):
  File "fast-export/hg-fast-export.py", line 728, in <module>
    sys.exit(hg2git(options.repourl,m,options.marksfile,options.mappingfile,
  File "fast-export/hg-fast-export.py", line 581, in hg2git
    c=export_commit(ui,repo,rev,old_marks,max,c,authors,branchesmap,
  File "fast-export/hg-fast-export.py", line 366, in export_commit
    export_file_contents(ctx,man,modified,hgtags,fn_encoding,plugins)
  File "fast-export/hg-fast-export.py", line 222, in export_file_contents
    file_data = {'filename':filename,'file_ctx':file_ctx,'data':d}
UnboundLocalError: local variable 'file_ctx' referenced before assignment
```

This commit fixes the error by:

 * initializing the file_ctx before the largefile handling takes place
 * Providing a new `is_largefile` value for plugins so they can detect
    if largefile handling was applied (and therefore the file_ctx
    object may no longer be in sync with the git version of the file)
2025-08-11 08:30:17 +02:00
Frej Drejhammar
f71385ec14 Fix "Warn if one of the marks, mapping, or heads files are empty"
The commit "Warn if one of the marks, mapping, or heads files are
empty" (7224e420a7) mixed up the state and heads caches and reported
that the heads cache was empty if the state case was. Error found by
Shun-ichi Goto.

Closes #338
2025-06-05 16:50:56 +02:00
Frank Zingsheim
bd707b5d6e Fix: Largefiles ignored #141
Import mercurial large files as ordinary files into git

The basic idea to this fix is based on
https://github.com/planestraveler/fast-export/tree/add-lfs-support-v2
from PR #65

Closes #141
2025-03-29 18:39:27 +01:00
Thalia Archibald
f947189dcc Consistently terminate commit messages with LF
When the length logic for fast-import 'data' commands was updated in
4c10270 (Fix data handling, 2023-03-02), one branch was missed, so
commit messages now do not have a final LF appended in most cases. This
changed the longtime behavior, which had been consistent since the first
commit of hg2git, 9832035 (Initial import, 2007-03-06), and is expected
by some applications which compare against old conversions from
Mercurial.
2024-07-05 05:20:35 -07:00
Frej Drejhammar
fb225c4700 Merge branch 'gh/321' 2024-02-23 17:07:02 +01:00
Stephan Hohe
e63feee1b9 Don't add file if plugin sets content to None 2024-02-20 17:07:23 +01:00
Stephan Hohe
7b4bb7ff1d Fix escape in regular expression 2024-02-19 23:40:05 +01:00
Frej Drejhammar
ddfc3a8300 Run file_data_filter on deleted files
The `file_data_filter` method should be called when files are deleted.
In this case the `data` and `file_ctx` keys map to None. This is so
that a filter which modifies file names can apply the same name
transformations before files are deleted.
2024-02-16 17:12:49 +01:00
Ekin Dursun
c49dd0cf60 Remove Python 2 compatibility code
Python 2 support was removed recently, so we don't need the
compatibility code anymore.
2023-11-18 20:22:18 +03:00
Felipe Contreras
9754a9f3f6 Trivial simplification
Just return the values directly, no need to store them into variables.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-14 22:12:50 -06:00
Felipe Contreras
d2f11bd619 Remove multiple parent logic for file changes
This is already what repo.status does.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-14 22:12:50 -06:00
Felipe Contreras
3582221efd Compare changes only with the first parent
It's not necessary to check both parents.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-14 22:12:50 -06:00
Felipe Contreras
0ae0d20496 Remove no-op check
This code is only executed when there's two parents.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-14 22:12:50 -06:00
Felipe Contreras
e09a14a266 Move parents logic inside get_filechanges
This way export_commit is much simpler (already quite complex), and it's
easier to modify the logic.

No functional changes.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-14 22:12:50 -06:00
Felipe Contreras
9df2f97f6c Rename variables in get_filechanges
It's easier to understand this way.

No functional changes.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-14 22:12:50 -06:00
Felipe Contreras
531fa9b3a2 Simplify split_dict
There's no need to keep track of the left side: if it's modified it's
modified.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-14 22:12:50 -06:00
Felipe Contreras
a229b39d66 Coalesce modified files
Git doesn't care if they are added or changed: they are modified.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-14 22:12:50 -06:00
Felipe Contreras
c666fd9c95 Trivial style cleanup
Checking the array directly is more idiomatic.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-14 22:12:50 -06:00
Felipe Contreras
21fa443b4a Simplify list of files for the first commit
We already have the files.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-14 22:12:50 -06:00
Felipe Contreras
6fbe4d0ad0 Skip earlier
Now that we have ctx easily available, skip early.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-10 12:38:42 -06:00
Felipe Contreras
fa73d8dec9 Share the changectx more
It's used everywhere, might as well pass it along.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-10 12:38:30 -06:00
Felipe Contreras
e1e15b2091 Avoid revsymbol()
We can just do repo[rev].

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-09 19:48:44 -06:00
Felipe Contreras
534d2bdd92 Don't deal with the node in get_changeset()
It's not necessary.

It could be fetched with repo[rev].node(), but why bother?

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-09 19:48:44 -06:00
Felipe Contreras
23f41c0ff1 Use revision directly instead of revnode
We don't need the revnode.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-09 19:48:44 -06:00
Felipe Contreras
8b1fd408ca Use changectx directly
There's no need to call repo[revnode] when repo[rev] works perfectly
fine.

And since we have the context already we can just do ctx.hex() instead
of hexlifying ourselves.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-09 19:48:44 -06:00
Felipe Contreras
4a4d242e98 Fetch node directly
No need to call get_changeset() for that.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-09 19:48:44 -06:00
Felipe Contreras
432254100b Fetch branch names directly
No need to use get_changeset() for just one thing.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-09 19:48:44 -06:00
Felipe Contreras
5e4bc6eb03 Remove cruft
Nothing uses that variable.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-09 19:48:44 -06:00
Felipe Contreras
bbab981130 Trivial simplification of wr
No need to issue two write commands.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-04 16:08:45 +01:00
Felipe Contreras
c3cbf1e04d Add wr_data helper
No functional changes.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-03 19:34:29 -06:00
Felipe Contreras
4c10270302 Fix data handling
The length should be exactly the same as the data, for example if the
data is "hello" only 5 characters should be written on the stream. Thus
it should always be `len(data)`, not `len(data)+1` as it currently is in
some places.

Since the first commit of hg2git.py there was a wtf comment, presumably
Rocco was confused about this common discrepancy.

We can shuffle the logic around by adding '\n' to the data, and removing
+1 to the length.

Also, the data should be written without a newline (wr_no_nl).

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
2023-03-03 19:33:45 -06:00
chrisjbillington
13c273f10c Resolve unicode escape sequences not being processed correctly
In `process_unicode_escape_sequences()`, any backslash escape sequences
in the original string are escaped upon the first
`.encode('unicode-escape')` and therefore round-trip the sequence of
`.encode('unicode-escape').decode('unicode-escape')`.

That is not what we want - we want these sequences to be passed-through
the `.encode` unchanged, so that they will be converted to the
character they represent upon `.decode()`.

This patch changes the `.encode()` step to pass through any ascii
characters unchanged, only escaping non-ascii characters. This ensures
any existing backslash escape sequences will be interpreted as the
character they represent upon `.decode()`.
2022-10-23 11:51:33 +11:00
Frej Drejhammar
f179afce65 Fix FutureWarning about nested sets in re
Since Python 3.7 the re module warns for syntax which could, in the
future, be misparsed as a nested set. Avoid this by escaping the
literal `[` we search for in the regexp.

Reported by Monte Davidoff @mndavidoff

Closes #269.
2022-02-09 15:37:29 +01:00
Frej Drejhammar
5b7ca5aaec Give proper error message when refusing to overwrite existing branch
If fast-export was asked to export a Mercurial branch to Git and a
branch of the same name already existed in the Git repo but it was not
created by fast export, fast-export would crash while trying to format
an error message claiming that the destination branch was modified
behind its back.

This patch extends fast-export to detect the situation above and give
a proper error message which hopefully is less confusing to the user.

Credits for discovering the original crash goes to Shun-ichi Goto
<gotoh@taiyo.co.jp>.

Closes: #269.
2021-08-27 16:04:40 +02:00
Frej Drejhammar
bdfc0c08c7 Merge branch 'frej/issue-258'
Closes 258
2021-02-26 16:44:31 +01:00
SirIntellegence
20c22a3110 Add plugin support for the 'extra' field
Permits plugins to import other information such as svn conversion revisions
2021-02-22 13:09:48 -07:00
Frej Drejhammar
f741bf39f2 bugfix: Avoid starting incremental conversions from scratch
Keys and values in the state cache are byte strings, therefore a
lookup of 'tip' will always fail. The failure makes the conversion
start over from the beginning, but as fast-export is deterministic the
results are the same, just very inefficient. The bug has existed since
the port to Python 3.

This patch switches the 'tip' lookup to use a byte string which should
make incremental conversions restart at the last converted commit. As
'x' == b'x' in Python 2, this should be a backwards compatible change.

Bug reported and fix suggested by Tomas Kolda.

Fixes #258.
2021-02-19 16:47:53 +01:00
Frej Drejhammar
7057ce2c2b Allow plugins to modify the committer
Plugins have since they were introduced been able to modify the author
of a commit, but not the committer. This patch adds the necessary
support for allowing them to also modify the committer.
2020-09-30 17:47:33 +02:00
Ondrej Stanek
9c6dea9fd4 Pass original hg commit hash to plugins 2020-07-31 10:50:51 +02:00
Ethan Furman
5c1cbf82b0 Add revision to commit_data for commit plugins
Co-Authored-By: ostan89@gmail.com
2020-07-31 10:48:33 +02:00
Ondrej Stanek
50631c4b34 Add option --ignore-unnamed-heads
This option allows the user to ignore only unnamed heads (compared to --force
which ignores all non-fatal issues). The intended use is for a future plugin
converting unnamed heads to named branches.
2020-07-31 10:30:53 +02:00
Ethan Furman
2a9dd53d14 Show all unnamed heads at once
Co-Authored-By: ostan89@gmail.com
2020-07-31 10:27:07 +02:00
chrisjbillington
d29d30363b Fix backward incompatible change for hg < 5.1
The port to Python 3 in b961f146 changed `repo.branchmap().iteritems()`
to use `.items()` instead. However, the object returned by mercurial
isn't a dictionary and its `.items()` method was only introduced (as an
alias for `iteritems`) in hg 5.1. `iteritems()` still exists, so let's
keep using it for now to retain compatibility with hg < 5.1.
2020-05-06 11:59:49 -04:00
Frej Drejhammar
f102d2a69f Merge branch 'PR/223'
Closes #223
2020-05-06 16:31:13 +02:00
Ondrej Stanek
cf0e5837b6 Allow converting a repository with git and hg subrepos
In the verification phase, fast-export falsely expects that both hg
and git subrepositories should have the appropriate line in the
subrepo-map file. The case is, that only hg subrepos need a line in
subrepo-map that references a converted subrepo, while git
subrepositories do not.
2020-05-06 16:30:05 +02:00
chrisjbillington
3b3f86b71e Allow utf8 in mappings
We were previously processing entries in mapping files (when
`--mappings-are-raw` is not given) with
`.decode('unicode_escape').encode('utf8')` to replace backslash escape
sequences in bytestrings with the utf-8 encoded characters they
represent. However, it turns out that `.decode
('unicode_escape')` assumes latin-1 encoding if it encounters non-ascii
bytes: https://bugs.python.org/issue21331. So this gave incorrect
results if non-ascii utf8 data was present in the mapping.

To fix this, we now add an extra layer of `.decode('utf8').encode
('unicode-escape')` in order to convert any non-ascii characters into
their backslash escape sequences. Then the subsequent
`.decode('unicode_escape')` only encounters ascii characters and gives
correct results.
2020-03-25 12:33:42 -04:00
chrisjbillington
6361b44c33 Fix bug in ignoring .git files/folders on Windows
Mercurial internally stores (most) filepaths using forward slashes, and
returns them as such from its Python API, even on Windows.

So the splitting up of filepaths with `os.path.sep` was incorrect,
resulting in `.git` files (those within a subdirectory, anyway)
not being ignored on Windows as intended. Splitting on `b'/'` regardless
of OS fixes this.
2020-03-08 19:40:50 +01:00
chrisjbillington
48508ee299 Fix failure to print error message in verify_heads
On Python 3, `b'%s' % None` fails with a TypeError. In verify_heads,
an error message prints the sha1 of a git commit, but that sha1
can be None.

This commit instead prints `b'<None>'` if sha1 is None.
2020-03-06 11:02:38 -05:00
Max Fuqua
750fe6d3e1 Resolve type error resulting from passing an int to b'%s' in python3 2020-02-29 14:55:15 -05:00
chrisjbillington
4071f720b0 Fix issue #203: Resolve stderr encoding issues
In Python 3, `sys.stderr.write()` requires unicode strings, and all
output on standard streams is UTF8 encoded. Therefore in the port to
Python 3, we `.decode()`d all strings that are used in `%` formatting of
strings to be printed to stderr.

However, in Python 2, `sys.stderr` accepts either bytestrings or unicode
strings, and:

- `%s` formatting of a bytestring with a unicode string, i.e  `"%s" %
  u"foo"` results in a unicode string.
- Writing a unicode string to stderr/stdout uses that stream's encoding
- When the output of the process is being piped somewhere other than a
  terminal (as it is when called with pipes and shell redirection from
  hg-fast-export.sh), that encoding is None, which implies ASCII.
- This raises UnicodeEncodeError if the unicode strings passed to
  `stderr.write()` have non-ascii characters.

We cannot fix this problem simply by encoding UTF8 again before writing
to stderr on Python 2. This is because the *decoding* of filenames with
the UTF8 codec may fail - filenames may not even be valid UTF8 desite
this being the declared filesystem encoding.

We could `fsdecode()` filenames on Python 3, which would use the
`surrogateescape` error handler, but stderr does not use this error
handler for output, meaning we would just have to encode again (with the
same error handler) anyway. And Python 2 lacks the `surrogateescape`
error handler in any case - we would need to reimplement it just to do a
round-trip decode and encode for no reason.

This commit leaves filenames and other repository data as bytestrings,
and simply writes them to `sys.stderr.buffer` on Python 3 or
`sys.stderr` on Python 2 as-is, after `%` formatting with bytestring
literals. This avoids encoding issues of filenames altogether.

Other writing to stderr that does not involve repository data has been
left with "native" strings, i.e.
`sys.stderr.write("a string literal %s" % a_command_line_arg)`. These
will still fail on Python 3 if the user passes a non-UTF filename as a
command line argument or similar. This is acceptable IMHO - although
`hg-fast-export` may encounter invalid UTF8 in mercurial repositories,
it is not too much to impose that the user name their branch mapping
files etc with valid UTF8!
2020-02-19 12:18:00 -05:00