When Plugins are used in a repository that contains largefiles,
the following exception is thrown as soon as the first largefile
is converted:
```
Traceback (most recent call last):
File "fast-export/hg-fast-export.py", line 728, in <module>
sys.exit(hg2git(options.repourl,m,options.marksfile,options.mappingfile,
File "fast-export/hg-fast-export.py", line 581, in hg2git
c=export_commit(ui,repo,rev,old_marks,max,c,authors,branchesmap,
File "fast-export/hg-fast-export.py", line 366, in export_commit
export_file_contents(ctx,man,modified,hgtags,fn_encoding,plugins)
File "fast-export/hg-fast-export.py", line 222, in export_file_contents
file_data = {'filename':filename,'file_ctx':file_ctx,'data':d}
UnboundLocalError: local variable 'file_ctx' referenced before assignment
```
This commit fixes the error by:
* initializing the file_ctx before the largefile handling takes place
* Providing a new `is_largefile` value for plugins so they can detect
if largefile handling was applied (and therefore the file_ctx
object may no longer be in sync with the git version of the file)
The commit "Warn if one of the marks, mapping, or heads files are
empty" (7224e420a7) mixed up the state and heads caches and reported
that the heads cache was empty if the state case was. Error found by
Shun-ichi Goto.
Closes#338
When the length logic for fast-import 'data' commands was updated in
4c10270 (Fix data handling, 2023-03-02), one branch was missed, so
commit messages now do not have a final LF appended in most cases. This
changed the longtime behavior, which had been consistent since the first
commit of hg2git, 9832035 (Initial import, 2007-03-06), and is expected
by some applications which compare against old conversions from
Mercurial.
The `file_data_filter` method should be called when files are deleted.
In this case the `data` and `file_ctx` keys map to None. This is so
that a filter which modifies file names can apply the same name
transformations before files are deleted.
This way export_commit is much simpler (already quite complex), and it's
easier to modify the logic.
No functional changes.
Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
There's no need to call repo[revnode] when repo[rev] works perfectly
fine.
And since we have the context already we can just do ctx.hex() instead
of hexlifying ourselves.
Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
The length should be exactly the same as the data, for example if the
data is "hello" only 5 characters should be written on the stream. Thus
it should always be `len(data)`, not `len(data)+1` as it currently is in
some places.
Since the first commit of hg2git.py there was a wtf comment, presumably
Rocco was confused about this common discrepancy.
We can shuffle the logic around by adding '\n' to the data, and removing
+1 to the length.
Also, the data should be written without a newline (wr_no_nl).
Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
In `process_unicode_escape_sequences()`, any backslash escape sequences
in the original string are escaped upon the first
`.encode('unicode-escape')` and therefore round-trip the sequence of
`.encode('unicode-escape').decode('unicode-escape')`.
That is not what we want - we want these sequences to be passed-through
the `.encode` unchanged, so that they will be converted to the
character they represent upon `.decode()`.
This patch changes the `.encode()` step to pass through any ascii
characters unchanged, only escaping non-ascii characters. This ensures
any existing backslash escape sequences will be interpreted as the
character they represent upon `.decode()`.
Since Python 3.7 the re module warns for syntax which could, in the
future, be misparsed as a nested set. Avoid this by escaping the
literal `[` we search for in the regexp.
Reported by Monte Davidoff @mndavidoff
Closes#269.
If fast-export was asked to export a Mercurial branch to Git and a
branch of the same name already existed in the Git repo but it was not
created by fast export, fast-export would crash while trying to format
an error message claiming that the destination branch was modified
behind its back.
This patch extends fast-export to detect the situation above and give
a proper error message which hopefully is less confusing to the user.
Credits for discovering the original crash goes to Shun-ichi Goto
<gotoh@taiyo.co.jp>.
Closes: #269.
Keys and values in the state cache are byte strings, therefore a
lookup of 'tip' will always fail. The failure makes the conversion
start over from the beginning, but as fast-export is deterministic the
results are the same, just very inefficient. The bug has existed since
the port to Python 3.
This patch switches the 'tip' lookup to use a byte string which should
make incremental conversions restart at the last converted commit. As
'x' == b'x' in Python 2, this should be a backwards compatible change.
Bug reported and fix suggested by Tomas Kolda.
Fixes#258.
Plugins have since they were introduced been able to modify the author
of a commit, but not the committer. This patch adds the necessary
support for allowing them to also modify the committer.
This option allows the user to ignore only unnamed heads (compared to --force
which ignores all non-fatal issues). The intended use is for a future plugin
converting unnamed heads to named branches.
The port to Python 3 in b961f146 changed `repo.branchmap().iteritems()`
to use `.items()` instead. However, the object returned by mercurial
isn't a dictionary and its `.items()` method was only introduced (as an
alias for `iteritems`) in hg 5.1. `iteritems()` still exists, so let's
keep using it for now to retain compatibility with hg < 5.1.
In the verification phase, fast-export falsely expects that both hg
and git subrepositories should have the appropriate line in the
subrepo-map file. The case is, that only hg subrepos need a line in
subrepo-map that references a converted subrepo, while git
subrepositories do not.
We were previously processing entries in mapping files (when
`--mappings-are-raw` is not given) with
`.decode('unicode_escape').encode('utf8')` to replace backslash escape
sequences in bytestrings with the utf-8 encoded characters they
represent. However, it turns out that `.decode
('unicode_escape')` assumes latin-1 encoding if it encounters non-ascii
bytes: https://bugs.python.org/issue21331. So this gave incorrect
results if non-ascii utf8 data was present in the mapping.
To fix this, we now add an extra layer of `.decode('utf8').encode
('unicode-escape')` in order to convert any non-ascii characters into
their backslash escape sequences. Then the subsequent
`.decode('unicode_escape')` only encounters ascii characters and gives
correct results.
Mercurial internally stores (most) filepaths using forward slashes, and
returns them as such from its Python API, even on Windows.
So the splitting up of filepaths with `os.path.sep` was incorrect,
resulting in `.git` files (those within a subdirectory, anyway)
not being ignored on Windows as intended. Splitting on `b'/'` regardless
of OS fixes this.
On Python 3, `b'%s' % None` fails with a TypeError. In verify_heads,
an error message prints the sha1 of a git commit, but that sha1
can be None.
This commit instead prints `b'<None>'` if sha1 is None.
In Python 3, `sys.stderr.write()` requires unicode strings, and all
output on standard streams is UTF8 encoded. Therefore in the port to
Python 3, we `.decode()`d all strings that are used in `%` formatting of
strings to be printed to stderr.
However, in Python 2, `sys.stderr` accepts either bytestrings or unicode
strings, and:
- `%s` formatting of a bytestring with a unicode string, i.e `"%s" %
u"foo"` results in a unicode string.
- Writing a unicode string to stderr/stdout uses that stream's encoding
- When the output of the process is being piped somewhere other than a
terminal (as it is when called with pipes and shell redirection from
hg-fast-export.sh), that encoding is None, which implies ASCII.
- This raises UnicodeEncodeError if the unicode strings passed to
`stderr.write()` have non-ascii characters.
We cannot fix this problem simply by encoding UTF8 again before writing
to stderr on Python 2. This is because the *decoding* of filenames with
the UTF8 codec may fail - filenames may not even be valid UTF8 desite
this being the declared filesystem encoding.
We could `fsdecode()` filenames on Python 3, which would use the
`surrogateescape` error handler, but stderr does not use this error
handler for output, meaning we would just have to encode again (with the
same error handler) anyway. And Python 2 lacks the `surrogateescape`
error handler in any case - we would need to reimplement it just to do a
round-trip decode and encode for no reason.
This commit leaves filenames and other repository data as bytestrings,
and simply writes them to `sys.stderr.buffer` on Python 3 or
`sys.stderr` on Python 2 as-is, after `%` formatting with bytestring
literals. This avoids encoding issues of filenames altogether.
Other writing to stderr that does not involve repository data has been
left with "native" strings, i.e.
`sys.stderr.write("a string literal %s" % a_command_line_arg)`. These
will still fail on Python 3 if the user passes a non-UTF filename as a
command line argument or similar. This is acceptable IMHO - although
`hg-fast-export` may encounter invalid UTF8 in mercurial repositories,
it is not too much to impose that the user name their branch mapping
files etc with valid UTF8!